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Preface 


Welcome to Anaheim and to the 1997 USENIX Conference! 


On behalf of the USENIX Association, thank you for coming to the conference. We have a lot going on here this 
year: the refereed paper tracks, invited talks, works-in-progress and guru-is-in sessions, and two full days of timely 
tutorials. New at the 1997 conference is the co-location of and joint registration with the USELINUX conference. 
We think you’ll find plenty of interesting sessions here to keep you busy the whole week! 


We have chosen 23 papers for the refereed track this year. They cover topics such as filesystems, networking, net- 
worked systems, programming tools, user tools, and performance. Please extend my thanks to the 155 authors from 
37 universities, 13 companies, and 11 countries for their 74 paper submissions for the technical track. Without their 
papers describing their recent efforts, we would not have a refereed track to present. Some of the papers not 
accepted for the refereed track will appear infornally, in the works-in-progress session or maybe even in a birds-of- 
a-feather gathering. 


There are many people involved in preparing a USENIX conference, many more than space here can mention. 
Those of special note include the wonderful USENIX staff I worked with: Ellie Young, Judy DesHamais, Zanna 
Knight, and Pennfield Jensen—they were always glad to offer me advice, direction, and able assistance. Dan Geer, 
my USENIX board liaison, kept a watchful eye and was always ready to listen. Special thanks go to Mary Baker and 
Berry Kercheval for arranging the invited talks sessions; to Dan Klein who organized the fantastic tutorial selection; 
to Margo Seltzer and Vera Gropper at Harvard University for hosting the program committee meeting; to Keith 
Smith who served as the Program Commitee’s scribe; and to my employer, Pure Atria, for supporting my work as 
chair. 


Finally, I would like to thank my program committee (10 hardy souls) and the external reviewers (listed separately) 
for their long hours reading papers and writing detailed reviews last summer. The program committee provided 
feedback to the authors of each submission, drawing from the 5 or more reviews of each paper. We've got a fine 
technical track this year due to the hard work and hard decisions of these volunteers. 


John Kohl, Program Chair 
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Embedded Inodes and Explicit Grouping: 
Exploiting Disk Bandwidth for Small Files 


Gregory R. Ganger and M. Frans Kaashoek 
M.I.T. Laboratory for Computer Science 
Cambridge MA 02139, USA 
{ganger, kaashoek}@lcs.mit.edu 
http://www.pdos.lcs.mit.edu/ 


Abstract 


Small file performance in most file systems is limited by 
slowly improving disk access times, even though cur- 
rent file systems improve on-disk locality by allocat- 
ing related data objects in the same general region. 
The key insight for why current file systems perform 
poorly is that locality is insufficient — exploiting disk 
bandwidth for small data objects requires that they be 
placed adjacently. We describe C-FFS (Co-locating 
Fast File System), which introduces two techniques, 
embedded inodes and explicit grouping, for exploit- 
ing what disks do well (bulk data movement) to avoid 
what they do poorly (reposition to new locations). With 
embedded inodes, the inodes for most files are stored 
in the directory with the corresponding name, remov- 
ing a physical level of indirection without sacrificing the 
logical level of indirection. With explicit grouping, the 
data blocks of multiple small files named by a given di- 
rectory are allocated adjacently and moved to and from 
the disk as a unit in most cases. Measurements of our 
C-FFS implementation show that embedded inodes and 
explicit grouping have the potential to increase small 
file throughput (for both reads and writes) by a factor 
of 5-7 compared to the same file system without these 
techniques. The improvement comes directly from re- 
ducing the number of disk accesses required by an order 
of magnitude. Preliminary experience with software- 
development applications shows performance improve- 
ments ranging from 10-300 percent. 


1 Introduction 


It is frequently reported that disk access times have not 
kept pace with performance improvements in other sys- 
tem components. However, while the time required to 
fetch the first byte of data is high (i.e., measured in mil- 


liseconds), the subsequent data bandwidth is reasonable 
(> 10 MB/second). Unfortunately, although file systems 
have been very successful at exploiting this bandwidth 
for large files [Peacock88, McVoy91, Sweeney96], they 
have failed to do so for small file activity (and the cor- 
responding metadata activity). Because most files are 
small (e.g., we observe that 79% of all files on our file 
servers are less than 8 KB in size) and most files ac- 
cessed are small (e.g., [Baker91] reports that 80% of 
file accesses are to files of less than 1OKB), file system 
performance is often limited by disk access times rather 
than disk bandwidth. 

One approach often used in file systems like the fast 
file system (FFS) [McKusick84] is to place related data 
objects (e.g., an inode and the data blocks it points to) 
near each other on disk (e.g., in the same cylinder group) 
in order to reduce disk access times. This approach can 
successfully reduce the seek time to just a fraction of that 
for a random access pattern. Unfortunately, it has some 
fundamental limitations. First, it affects only the seek 
time component of the access time!, which generally 
comprises only about half of the access time even for 
random access patterns. Rotational latency, command 
processing, and data movement, which are not reduced 
by simply placing related data blocks in the same gen- 
eral area, comprise the other half. Second, seek times 
do not drop linearly with seek distance for small dis- 
tances. Seeking a single cylinder (or just switching 
between tracks) generally costs a full millisecond, and 
this cost rises quickly for slightly longer seek distances 
(Worthington95]. Third, it is successful only when no 
other activity moves the disk arm between related re- 
quests. As a result, this approach is generally limited 
to providing less than a factor of two improvement in 
performance (and often much less). 

Another approach, the log-structured file system 


'We use the terms access time and service time interchangeably to 
tefer to the time from when the device driver initiates a read or write 
tequest to when the request completioninterrupt occurs. 
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(LFS), exploits disk bandwidth for all file system data, 
including large files, small files, and metadata. The 
idea is to delay, remap and cluster all modified blocks, 
only writing large chunks to the disk [Rosenblum92]. 
Assuming that free extents of disk blocks are always 
available, LFS works extremely well for write activ- 
ity. However, the design is based on the assumption 
that file caches will absorb all read activity and does 
not help in improving read performance. Unfortunately, 
anecdotal evidence, measurements of real systems (e.g., 
[Baker91}), and simulation studies (e.g., [Dahlin94]) all 
indicate that main memory caches have not eliminated 
read traffic. 

This paper describes the co-locating fast file system 
(C-FFS), which introduces two techniques for exploiting 
disk bandwidth for small files and metadata: embedded 
inodes and explicit grouping. Embedding inodes in the 
directory thatnames them (unless multiple directories do 
sO), rather than storing them in separate inode blocks, 
removes a physical on-disk level of indirection without 
sacrificing the logical level of indirection. This tech- 
nique offers many advantages: it halves the number of 
blocks that must be accessed to open a file; it allows the 
inodes for all names in a directory to be accessed with- 
out requesting additional blocks; it eliminates one of 
the ordering constraints required for integrity during file 
creation and deletion; it eliminates the need for static 
(over-)allocation of inodes, increasing the usable disk 
capacity [Forin94]; and it simplifies the implementation 
and increases the efficiency of explicit grouping (there 
is a synergy between these two techniques). 

Explicit grouping places the data blocks of multiple 
files at adjacent disk locations and accesses them as a 
single unit most of the time. To decide which small 
files to co-locate, C-FFS exploits the inter-file relation- 
ships indicated by the name space. Specifically, C-FFS 
groups files whose inodes are embedded in the same di- 
rectory. The characteristics of disk drives have reached 
the point that accessing several blocks rather than just 
one involves a fairly small additional cost. For exam- 
ple, even assuming minimal seek distances, accessing 
16 KB requires only 10% longer than accessing 8 KB, 
and accessing 64 KB requires less than twice as long 
as accessing a single 512-byte sector. Further, the rela- 
tive cost of accessing more data has been dropping over 
the past several years and should continue to do so. As 
a result, explicit grouping has the potential to improve 
small file performance by an order of magnitude over 
conventional file system implementations. Because the 
incremental cost is so low, grouping will improve per- 
formance even when only a fraction of the blocks in a 
group are needed. 

Figure | illustrates the state-of-the-art and the im- 
provements made by our techniques. Figure 1 A shows 
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D. Explicit Grouping 


Figure 1: Organization and layout of file data on disk. 
This figure shows the on-disk locations of directory 
blocks (marked ‘D’), inode blocks (‘I’) and the data 
for five single-block files (‘Fl’ ~ ‘F5’) in four different 
scenarios: (A) the ideal conventional layout, (B) a more 
realistic conventional layout, (C) with the addition of 
embedded inodes, and (D) with both embedded inodes 
and explicit grouping (with a maximum group size of 
four blocks). 


the ideal layout of data and metadata for five single- 
block files, which might be obtained if one uses a fresh 
FFS partition. In this case, the inodes for all of the files 
are located in the same inode block and the directory 
block and the five file blocks are stored adjacently. With 
this layout, the prefetching performed by most disks will 
exploit the disk’s bandwidth for reads and scatter/gather 
I/O from the file cache can do so for writes. Unfortu- 
nately, a more realistic layout of these files for an FFS 
file system that has been in use for a while is more like 
that shown in Figure 1B. Reading or writing the same 
set of files will now require several disk accesses, most 
of which will require repositioning (albeit with limited 
seek distances, since the picture shows only part of a 
single cylinder group). With embedded inodes, one gets 
the layout shown in Figure 1C, wherein the indirection 
between on-disk directory entries and on-disk inodes is 
eliminated. Finally, with both embedded inodes and ex- 
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plicit grouping, one gets the layout shown in Figure 1D. 
In this case, one can read or write all five files with 
two disk accesses. Further, for files Fl and F2, one 
can read or write the name, inode and data block all 
in a single disk request. In our actual implementation, 
the maximum group size is larger than that shown and 
the allocation code would try to place the second group 
immediately after the first. 

We have constructed a C-FFS implementation that 
includes both of these techniques. Measurements of 
C-FFS as compared to the same file system without these 
techniques show that, for small file activity, embedded 
inodes and explicit grouping can reduce the number of 
disk accesses by more than an order of magnitude. On 
the system under test (a modern PC), this translates into 
a performance increase of a factor of 5—7 in small file 
throughput for both reads and writes. Although our 
evaluation is preliminary, experiments with actual appli- 
cations show performance improvements ranging from 
10-300 percent. 

The remainder of this paper is organized as follows: 
Section 2 provides more detailed motivation for co- 
location, by exploring the characteristics of modern disk 
drives and file storage usage. Section 3 describes our im- 
plementation of embedded inodes and explicit grouping. 
Section 4 shows that co-location improves performance 
significantly by comparing two file system implementa- 
tions that differ only in this respect. Section 5 discusses 
related work. Section 6 discusses some open questions. 
Section 7 summarizes this paper. 


2 Motivation 


The motivating insights for this work fall into two broad 
categories: (1) The performance characteristics of mod- 
ern disk drives, the de facto standard for on-line stor- 
age, force us to aggressively pursue adjacency of small 
objects rather than just locality. (2) The usage and or- 
ganizational characteristics of popular file systems both 
suggest the logical relationships that can be exploited 
and expose the failure of current approaches in placing 
related data adjacently. 


2.1 Modern Disk Drive Characteristics 


Ithas repeatedly been pointed out that disk drive access 
times continue to fall behind relative to other system 
components. However, disk drive manufacturers have 
not been idle. They have matched the improvement 
rates of other system components in areas other than the 
access time, such as reliability, cost per byte, and record- 
ing density. Also, although it has not improved quite as 
rapidly, bulk data bandwidth has improved significantly. 


This section uses characteristics of modern disk drives 
to show that reading or writing several 4KB or 8KB disk 
blocks costs a relatively small amount more than read- 
ing or writing only one (e.g., 10% for 8KB extra and 
100 % for S6KB extra), and that this incremental cost is 
dropping. Because of the high cost of accessing a single 
block, it makes sense toaccess several even if only some 
of them are likely to be necessary. Readers who do not 
need to be convinced are welcome to skip to Section 22. 

The service time for a disk request can be broken 
into two parts, one that is dependent on the amount of 
data being transferred and one that is not. The former 
usually consists of just the media transfer time (unless 
the media transfer occurs separately, as with prefetch or 
write-behind). Most disks overlap the bus transfer time 
with the positioning and media transfer times, such that 
what remains, the ramp-up and ramp-down periods, is 
independent of the size. Most of the service time com- 
ponents, most notably including command processing 
overheads, seek times, and rotational latencies, are inde- 
pendent of the request size. With modern disk drives and 
small requests, the size-independent parts of the service 
time dominate the size-dependent part (e.g., > 90% of 
the total for random 8KB requests). 

Another way to view the same two aspects of the 
service time are as per-request and per-byte. The domi- 
nating performance characteristic of modern disk drives 
is that the per-request cost is much larger than the per- 
byte cost. Therefore, transferring larger quantities of 
useful data in fewer requests will result in a significant 
performance increase. 

One approach currently used by many file system im- 
plementors is to try to place related data items close to 
each other on the disk, thereby reducing the per-request 
cost. This approach does improve performance, but not 
nearly as much as one might hope. It is generally lim- 
ited to reducing seek distances, and thereby seek times, 
which represent only about half of the per-request time. 
Rotational latency, command processing and data move- 
ment comprise the other half. In addition, even track 
switches and single-cylinder seeks require a significant 
amount of time (e.g., a millisecond or more) because 
they still involve mechanical movement and settling de- 
lays. Further, this approach is successful only when no 
other activity uses the disk (and thereby moves the disk 
arm) between related requests. As aresult, this approach 
is generally limited to providing less than a factor of two 
in performance (and in practice much less). 

To help illustrate the above claims, Table 1 lists char- 
acteristics for three state-of-the-art (for 1996 ) disk drives 
[HP96, Quantum96, Seagate96 J. Figure 2 shows, for the 
same three drives, average access times as a function of 
the request size. Several points can be inferred from 
these graphs. First, the incremental cost of reading or 
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Hewlett-Packard 
C3653a 
8.7 GB 
5371 


Disk Drive 
Specification 
Capacity 
Cylinders 


Seagate Quantum 
9.1 GB 5 
5333 


Surfaces 20 20 
Sectors per Track 


124-173 
7200 RPM 
< lms 
< 1ms 
8.7 ms (0.8 ms) 
16.5 ms (1.0 ms) 


153-239 
7200 RPM 
N/A 
0.6 ms (0.5 ms) 
8.0 ms (1.5 ms) 
19.0 ms (1.0 ms) 


Rotation Speed 
Head Switch 
One-cyl. Seek 
Average Seek 
Maximum Seek 


7200 RPM 
N/A 
1.0 ms 
7.9 ms 
18.0 ms 





Table 1: Characteristics of three modem disk drives, taken from [HP96, Seagate96, Quantum96]. N/A indicates 
that the information was not available. For the seek times, the additional time needed for write settling is shown in 
parentheses, if it was available. 
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(a) HP C3653a (b) Seagate Barracuda ST19171FC (c) Quantum Atlas II 


Figure 2: Average service time as a function of request size. The X-axis uses log scale. Two of the lines in each 
plot represent the access time assuming an average seek time and a single-cylinder seek time, respectively. The other 
two lines in each plot (labelled “immed”) represent the same values, but assuming that the disk utilizes a technique 
known as immediate or zero-latency access. The basic idea is to read or write the data sectors in the order that they 
pass under the read/write head rather than in strictly ascending order. This can eliminate part or all of the rotational 
latency aspect of the service time. These values were calculated from the data in Table | using the read seek times 
(which are shorter), assuming zero command initiation/processing overheads, and using the sector/track value for 
the outermost zone. 


writing several blocks rather than just one issmall. For —_ of techniques that co-locate multiple small data objects. 


example, a 16 KB access takes less than 10% longer than 
an 8 KB access, even assuming a minimal seek time. A 
64 KB access takes less than twice as long as an 8 KB ac- 
cess. Second, immediate or zero-latency access extends 
the range of request sizes that can be accessed for small 
incremental cost, which will increase the effectiveness 
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Third, the seek time for random requests represents a 
little more than half the total service time. Therefore, 
eliminating it completely for a series of requests can 
halve each service time after the first. In comparison, 
co-location will eliminate the entire service time for a 
set of requests after paying only a slightly higher cost 
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for the first. So, while reducing seek times can improve 
performance somewhat, aggressive co-location has the 
potential to provide much higher returns. 

Not only are per-byte costs small relative to per- 
Tequest costs, but they have been decreasing and are 
likely to continue to do so. For example, the HP C2247 
disk drive [HP91, HP92] of a few years ago had only 
half as many sectors on each track as the HP C3653 
listed in Table 1, but an average access time that was 
only 33% higher. As a result, a request of 64KB, which 
takes only 2 times as long as a request of 8 KB on state- 
of-the-art disks, took nearly 3 times as long as a request 
of 8KB on older drives. Projecting this trend into the fu- 
ture suggests that co-location will become increasingly 
important. 


2.2 File System Characteristics 


File systems use metadata to organize raw storage capac- 
ity into human-readable name spaces. Most employ a 
hierarchical name space, using directory files to translate 
components of a full file name to an identifier for either 
the desired file or the directory with which to translate 
the next name component. Ateach step, the file system 
uses the identifier (e.g., an inode number in UNIX file 
systems) to determine the location of the file’s meta- 
data (e.g., an inode). This metadata generally includes 
a variety of information about the file, such as the last 
modification time, the length, and pointers to where the 
actual data blocks are stored. This organization involves 
several levels of indirection between a file name and the 
corresponding data, each of which can result in a sepa- 
rate disk access. Additionally, the levels of indirection 
generally cause each file’s data to be viewed as an inde- 
pendent object (i.e., logical relationships between it and 
other files are only loosely considered). 

To get a better understanding of real file storage or- 
ganizations, we constructed a small utility to scan some 
of our group’s file servers. The two SunOS 4.1 servers 
scanned supply 13.8 GB of storage from 9 file systems 
on 5 disks. At the time of the examination, 9.8 GB (71% 
of the total available) was allocated to 466,271 files. 

In examining the statistics collected, three observa- 
tions led us to explore embedded inodes and explicit 
grouping. First, as shown in Figure 3 as well as by 
previous studies, most files are small (i.e, 79% of those 
on our server are smaller than a single 8KB block). In 
addition to this static view of file size distributions, stud- 
ies of dynamic file system activity have reported similar 
behavior. For example, [Baker9 1] states that, although 
most of the bytes accessed are in large files, “the vast 
majority of file accesses are to small files” (e.g., about 
80% of the files accessed during their study were less 
than 1OKB). These data, combined with the fact that 
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Figure 3: Distribution of file sizes measured on our file 
servers. 


file system implementors are very good at dealing with 
large files, suggest that it is important to address the 
perforinance of small file access. 


Second, despite the fact that each inode block con- 
tains 64 inodes, multiple inode blocks are often used to 
hold the metadata for files named in a particular direc- 
tory (see Figure 4). On average, we found that the ratio 
of names in a directory to referenced inode blocks is 
roughly six to one. That is, every sixth name translates 
to a different inode block. This suggests that the cur- 
rent mechanism for choosing where to place an inode 
(i.e., inode allocation) does a poor job of placing related 
inodes adjacently. With embedded inodes, we store in- 
odes in directories rather than pointers to inodes, except 
in the rare case (less than 0.1 percent, on our server) of 
having multiple links to a file from separate directories. 
In addition to eliminating a physical level of indirec- 
tion, embedding inodes reduces the number of blocks 
that contain metadata for the files named by a particular 
directory. 


Third, despite the fact that the allocation algorithm 
for single-block files in a directory might be expected 
to achieve an ideal layout (as in Figure 1A) on a brand 
new file system, these blocks tend to be local (e.g., in 
the same cylinder group) but not adjacent after the file 
system has been inuse. For example, looking at entries 
in their physical order in the directory, we find that only 
30% of directories are adjacent to the first file that they 
name. Further, fewer than 45% of files are placed adja- 
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Figure 4: Distributions of the number of entries per directory and the corresponding number of inode blocks referred 


to on our file servers. 


cent to the file whose name precedes their name in the 
directory.2 The consequence of this lack of adjacency 
is that accessing multiple files in a directory involves 
disk head movement for every other file (when the file 
cache does not capture the accesses). The lack of ad- 
jacency occurs because, for the first block of a file, the 
file system always begins looking for free space at the 
beginning of the cylinder group. With explicit grouping, 
new file blocks for small files will be placed adjacently 
to other files in the same directory whenever possible. 
This should represent a substantial reduction in the num- 
ber of disk requests required for a set of files named by 
a given directory. 

We believe that the relationships inherent in the name 
space can be exploited to successfully realize the per- 
formance potential of modern disks’ bandwidths. For 
example, many applications examine the attributes (i.e., 
reference the inodes) of files named by a given directory 
soon after scanning the directory itself (e.g., revision 
control systems that use modification times to identify 
files in the source tree that might have changed and di- 


2 Different applications access files named by a directory in dif- 
ferent orders. For example, many shell utilities re-order file listings 
alphabetically, while multi-file compile and link programs often apply 
a user-specified ordering. As a result, we also looked at file adjacency 
under altemative file sequences. For example, in alphabetical order, 
only 16% of directories are adjacent to their first file and only 30% of 
files are adjacentto their predecessor. Even with the best-case ordering 
(by ascending disk block address), only 64% of files are next to their 
predecessor. 
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rectory listing programs that format the output based on 
permissions). Such temporal locality can be exploited 
directly with embedded inodes. Similarly, it is common 
for applications to access multiple files in a directory 
(e.g., when scanning a set of files for a particular string 
or when compiling and linking a multi-file program), as 
opposed to accessing files randomly strewn throughout 
the name space. Therefore, explicitly grouping small 
files in a directory represents a significant performance 
advantage. 


3 Design and Implementation 


To better exploit the data bandwidth provided by modern 
disk drives, C-FFS uses embedded inodes and explicit 
grouping to co-locate related small data objects. This 
section describes the various issues that arise when ap- 
plying these techniques and how our C-FFS implemen- 
tation addresses them. 


3.1 Embedded Inodes 


Conceptually, embedding inodes in directories is 
straightforward. One simply eliminates the previous 
inode storage mechanism and replaces the inode pointer 
generally found in each directory entry with the in- 
ode itself. Unfortunately, a few complications do arise: 
(1) Finding the location of an arbitrary inode given an 
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inode number (and to avoid changing other file system 
components, this file ID must remain unique) must still 
be both possible and efficient. (2) As part of enabling 
the discovery of an arbitrary inode’s location, the cur- 
rent C-FFS implementation replaces inode numbers with 
inode /ocators that include an ID for the containing di- 
rectory. One consequence of this approach is that when 
a file is moved from one directory to another, its inode 
locator will change. This can cause difficulties for sys- 
tem components that are integrated with the file system 
and rely on inode number constancy. (3) Although files 
with links from multiple directories are rare, they are 
useful and should still be supported. Also, files with 
multiple links from within a directory should be sup- 
ported with low cost, since some applications (such as 
editors) use such links for backup management. (4) File 
system recovery after both system crashes and partial 
media corruption should be no less feasible than with a 
conventional file system organization. 

One of our implementation goals (on which we were 
almost successful) was to change nothing beyond the 
directory entry manipulation and on-disk inode access 
portions of the file system. This section describes how 
C-FFS implements embedded inodes, including how 
C-FFS addresses each of the above issues. 


Finding specific inodes 


By eliminating the statically allocated, directly indexed 
set of inodes, C-FFS breaks the direct translation be- 
tween inode number and on-disk inode location. To 
regain this, C-FFS replaces conventional inode numbers 
with inode locators and adds an additional on-disk data 
structure, called the directory map. In our current im- 
plementation, an inode locator has three components: a 
directory number, a sector number within the directory 
and an identifier within the sector. C-FFS currently sizes 
these at 16 bits, 13 bits and 3 bits, respectively, provid- 
ing more than an order of magnitude more room for 
growth in both directory count and directory size than 
appears to be necessary from the characteristics of our 
file server. However, shifting to a 64-bit inode locator 
would overcome any realistic limitations. 

To find the inode for an arbitrary inode locator, C-FFS 
uses the following procedure: 


1. Look in the inode cache. 


2. If the inode is not in the cache, use the directory 
number portion of the inode locator to index into 
the directory map. This provides the inode locator 
for the directory. A special directory number, 0, 
is used to refer to the “directory” containing multi- 
linked inodes that are not embedded (see below). 


3. Get the directory’s inode. This may require recur- 
sively executing steps 1, 2, and 3 several times. 
However, the recursion is always bounded by the 
root of the directory hierarchy and should rarely 
occur in practice, since getting to the file’s inode 
locator in the first place required looking in the di- 
rectory. Further, this recursion could be eliminated 
by exploiting the auxiliary physical location infor- 
mation included in the directory map (see below). 


4. Read the directory block that contains the sector 
referred to by the sector number component of the 
inode number. (Note that the sector number is not 
really necessary — it simply allows us to restrict 
the linear scan.) 


5. Traverse the set of directory entries in the sector to 
find the matching inode information, if it exists. 


Moving files 


Because C-FFS stores inodes inside the directories that 
name them, moving a file from one directory to another 
involves also moving the inode. Fortunately, this is easy. 
However, because inode locators encode the inode’s lo- 
cation, including the containing directory’s number, the 
inode locator must be changed when an inode moves.? 
In addition to file movement, C-FFS moves inodes when 
they are named by multiple directories (see below) and 
when the last name for a file is removed while it is 
still open (to support POSIX semantics). Unfortunately, 
changing the externally visible file ID can cause prob- 
lems for system components, such as an NFS server, 
that relies on constancy of the file ID for a given file. 
Some solutions to this problem that we have considered 
include not actually moving the inode (and using the 
external link entry type discussed below), keeping an 
additional structure to correlate the old and new inode 
locators and forcing other system components to deal 
with the change. We currently use this last solution, 
but are growing concemed that too many applications 
expect file IDs to remain constant. 

We currently believe that the correct solution is to 
bring back constant inode numbers and to introduce an- 
other table to translate inode numbers to inode locators. 
Properly implemented, we believe that such a table could 
be maintained with low cost. In particular, it would only 
need to be read when a desired inode is not in the inode 
cache. Also, by keeping both the inode number and the 


3Note that moving a directory will change the inodelocator for the 
directory but will not change the inode locators f or any of the files it 
names. The inode locators for these latter files encode the directory’s 
identity by its index into the directoty map rather than by its inode 
locator. 
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inode locator in each inode, the table could be considered 
soft state that is reconstructed if the system crashes. 


Supporting hard links 


Although it is rare to have multiple hard links to a sin- 
gle file, it can be useful and we want to continue to 
allow it. Therefore, C-FFS replaces the inode pointer 
field of conventional directories with a type that can 
take on one of five values: (1) invalid, which indicates 
that the directory entry is just a space holder; (2) em- 
bedded, which indicates that the inode itself is in the 
entry; (3) directory pointer, which indicates that the en- 
try contains a directory number thatcan be used to index 
into the directory map to find the corresponding inode 
locator (this entry type is used only for the special ‘.’ 
and ‘..’ entries); (4) internal link, which indicates that 
the entry contains a pointer to the location of the actual 
inode elsewhere within the directory; and (5) external 
link, which says that the entry contains a pointer to the 
location of the inode outside of the directory. Additional 
space in the directory entry is used to hold the inode or 
pointer for the latter four cases. 

As suggested by the existence of an external link type, 
C-FFS stores some inodes outside of directories. In par- 
ticular, C-FFS does this for inodes that are named by 
multiple files and for inodes that are not named but have 
been opened by some process. For the latter case, ex- 
ternalizing an inode is simple because we don’t need to 
guarantee its existence across asystem failure. External- 
izing an inode to add a second extemal link, on the other 
hand, is expensive in ourcurrent C-FFS implementation, 
requiring two synchronous writes (one to copy the inode 
to its new home and one to update the directory from 
which is was moved). Externalized inodes are kept ina 
dynamically-growable, file-like structure that is similar 
to the IFILEin BSD-LFS [Seltzer93]. Some differences 
are that the externalized inode structure grows as needed 
but does not shrink and its blocks do not move once they 
have been allocated. 


File system recovery 


One concem that re-arranging file system metadata 
raises is that of integrity in the face of system fail- 
ures and media corruption. Regarding the first, we 
have had no difficulties constructing an off-line file sys- 
tem recovery program much like the UNIX Fscx utility 
(McKusick94]. Although inodes are no longer at stat- 
ically determined locations, they can all be found (as- 
suming no media corruption) by following the directory 
hierarchy. We also do not believe that embedded inodes 
increase the time required to complete failure recovery, 
since reading the directories is required for checking link 
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counts anyway. In fact, embedded inodes reduce the ef- 
fort involved with verifying the link counts, since extra 
state need not be kept (a valid embedded inode has a link 
count of one plus the number of internal links). 

The one potential problem that embedding inodes in- 
troduces is that an unfortunate media corruption can 
cause all files below a corrupted directory to be lost. 
Even though they are uncorrupted, the single way to 
find them will have been destroyed. This is unaccept- 
able when compared to the current scheme, which loses 
a maximum of 64 files or directories when an inode 
block is destroyed. All files that become disconnected 
from the name hierarchy due to such a loss can still be 
found. Fortunately, we can employ a simple solution to 
this problem — redundancy. By augmenting the direc- 
tory map with the physical location of each directory’s 
inode, we eliminate the loss of directories and files be- 
low a lost directory. Although the absolute amount of 
loss may be slightly higher (because the density of use- 
ful information is higher), it should be acceptable given 
the infrequency of post-factory media corruption. 


Simplifying integrity maintenance 


Although the original goal of embedded inodes was to 
reduce the number of separate disk requests, a pleasant 
side effect is that we can also eliminate one of the se- 
quencing constraints associated with metadata updates 
[Ganger94]. In particular, by eliminating the physical 
separation between a name and the corresponding inode, 
C-FFS exploits a disk drive characteristic to atomically 
update the pair. Most disks employ powerful error cor- 
recting codes on each sector, which has the effect of elim- 
inating (with absurdly high probability) the possibility of 
only part of the sector being updated. So, by keeping the 
two items in the same sector, we can guarantee that they 
will be consistent with respect to each other. For file 
systems that use synchronous writes to ensure proper 
sequencing, this can result in a two-fold performance 
improvement [Ganger94]. For more aggressive imple- 
mentations (e.g. , [Hagmann87, Chutani92, Ganger9 5)), 
this reduces complexity and the amount of book-keeping 
required. 


Directory sizes 


A potential down-side of embedded inodes is that the 
directory size can increase substantially. While making 
certain that an inode and its name remain in the same 
sector, three directory entries with embedded 128-byte 
inodes can be placed in each 512-byte sector. Fortu- 
nately, as shown in Figure 4, the number of directory 
entries is generally small. For example, 94% of all di- 
rectories would require only one 8 KB block on our file 
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servers. For many of these, embedded inodes actually 
fit into space that was allocated anyway (the minimum 
unit of allocation is a 1 KB fragment). Still, there are a 
few directories with many entries (e.g., several of over 
1000 and one of 9000). For a large directory, embedded 
inodes could greatly increase the amount of data that 
must be read from disk in order to scan just the names 
in a directory (e.g., when adding yet another directory 
entry). We are not particularly concemed about this, 
since bandwidth is what disks are good at. However, if 
experience teaches us that large directories are a signif- 
icant problem, one option is to use the external inode 
“file” for the inodes of such directories. 


3.2. Grouping Small Files 


Like embedded inodes, small file grouping is concep- 
tually quite simple. C-FFS simply places several small 
files adjacently on the disk and read/writes the entire 
group as a unit. The three main issues that arise are 
identifying the disk locations that make up a group, al- 
locating disk locations appropriately, and caching the 
additional group items before they have been requested 
or identified. We discuss each of these below. 


Identifying the group boundaries 


To better exploit available disk bandwidth, C-FFS moves 
groups of blocks to and from the disk at once rather than 
individually. To do so, it must be possible to determine, 
for any given block, whether it is part of a group and, 
if so, what other disk locations are in the same group. 
C-FFS does this by augmenting each inode with two 
fields identifying the start and length of the group.* Be- 
cause grouping is targeted for small files, we don’t see 
any reason to allow an inode to point to data in more than 
one group. In fact, C-FFS currently allows only the first 
block (the only block for 79% of observed files) of any 
file other than a directory to be part of a group. Identi- 
fying the boundaries of the group is a simple matter of 
looking in the inode, which must already be present in 
order to identify the disk location of the desired block. 


Allocation for groups 


Before describing what is different about the allocation 
routines used to group small files, we want to stress 
what is not different. Placement of data for large files 
remains unchanged and should exploit clustering tech- 
nology [Peacock88, McVoy91]. Directories are also 
placed as in current systems, by finding a cylinder group 


* Although we have added fields to support grouping in the C-FFS 
prototype, it would seem reasonable to overload two of the indirect 
block pointers instead and simply disable grouping for large files. 


with many free blocks. (The fast file system actually 
performs directory allocation based on the number of 
free inodes rather than the amount of free space. With 
embedded inodes, however, this becomes both difficult 
and of questionable value.) 

The main change occurs in deciding where to place 
the first block (or fragment) of a file. The standard 
approach is to scan the freelist for a free block in the 
cylinder group that contains the inode, always starting 
from the beginning. As files come and go, the free re- 
gions towards the front of the cylinder group become 
fragmented. C-FFS, on the other hand, tries to incor- 
porate the new file block (or fragment) into an existing 
group associated with the directory that names the file. 
This succeeds if there is free space within one of the 
groups (perhaps due to a previous file deletion or trun- 
cation) or if there is a free block outside of a group that 
can be incorporated without causing the group to ex- 
ceed its maximum size (currently hard-coded to 64 KB 
in our prototype, which approximates the knees of the 
curves in Figure 2). To identify the groups associated 
with a particular directory, C-FFS exploits the fact that 
using embedded inodes gives us direct access to all of 
the relevant inodes (and thus their group information) 
by examining the directory’s inode (which identifies the 
boundaries of the first group) and scanning the directory 
(ignoring any inodes for directories). If the new block 
extends an existing group, C-FFS again exploits embed- 
ded inodes to scan the directory and update the group 
information for other group members. If the new block 
cannot be added to an existing group, the conventional 
approach is used to allocate the block and a new group 
with one member is created. 

One point worth stressing about explicit grouping is 
that it does not require all blocks within the boundaries 
of a group to belong to inodes in a particular directory. 
AlthoughC-FFS tries to achieve this ideal, arbitrary files 
can allocate blocks that end up falling within the bound- 
aries of an unrelated group. The decision to allow this 
greatly simplifies the management of group information 
(i.e., it is simply a hint that describes blocks that are 
hopefully related). Although this could result in sub- 
optimal behavior, we believe that it is appropriate given 
the fact that reading/writing a few extra disk blocks has 
a small incremental cost and given the complexity of 
alternative approaches. 

Multi-block directories are the exception to both the 
allocation routine described above and to the rule that 
only the first block of a file can be included in a group. 
Because scanning the contents of a directory is a com- 
mon operation, C-FFS tries to ensure that all of its blocks 
are part of the first group associated with the directory 
(i.e., the one identified by the directory’s inode). For- 
tunately, even with embedded inodes, most directories 
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(e.g., 94 percent for our file servers) will be less than 
one block in size. If a subsequent directory block must 
be allocated and can not be incorporated into the direc- 
tory’s first group, C-FFS selects one of the non-directory 
blocks in the first group, moves it (i.e., copies it to a new 
disk location and changes the pointer in the inode), and 
gives its location to the new directory block. Once again, 
this operation exploits embedded inodes to find a mem- 
ber of the first group from which to steal a block. Unfor- 
tunately, moving the group member’s block requires two 
synchronous writes (one to write the new block and one 
to update the corresponding inode). A better metadata 
integrity maintenance mechanism (e.g., write-ahead log- 
ging or soft updates) could eliminate these synchronous 
writes. When noblocks are available for stealing, C-FFS 
falls back to allocating a block via the conventional ap- 
proach (with a preference for a location immediately 
after the first group). 


Cache functionality required 


Grouping requires that the file block cache provide a 
few capabilities, most of which are not new. In par- 
ticular, C-FFS requires the ability to cache and search 
for blocks whose higher level identities (i.e., file ID 
and offset) are not known. Therefore, our file cache 
is indexed by both disk address>, like the original 
UNIX buffer cache, and higher-level identities, like the 
SunOS integrated caching and virtual memory system 
[Gingell87, Moran87]. C-FFS uses physical identities to 
insert newly-read blocks of a group into the cache with- 
out back-translating to discover their file/offset identi- 
ties. Instead, C-FFS inserts these blocks into the cache 
based on physical disk address and an invalid file/offset 
identity. When a cache miss occurs for a file/offset that 
is a member of any group, C-FFS searches the cache 
a second time, by physical disk address (since it might 
have been brought in by a previous grouped access). If 
the block is present under this identity, C-FFS changes 
the file/offset identity to its proper value and retums the 
block without an additional disk read. 

The ability to search the cache by physical disk ad- 
dress is also necessary when initiating a disk request for 
an entire group. When reading a group, C-FFS prunes 
the extent read based on which blocks are already in the 
cache (it would be disastrous to replace a dirty block 
with one read from the disk). When writing a group, 
C-FFS prunes the extent written to include only those 
blocks that are actually present in the cache and dirty. 

Given that C-FFS requires the cache to support 


> By physicaldisk address, we mean the address givento the device 
dtiver when initiating a disk request, as opposed to more detailed 
information about how the cylinder, surface and rotational offset at 
which the data are stored inside the disk. 
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Capacity 1 GB 
Cylinders 2700 
Surfaces 9 
Sectors/Track 84 
5400 RPM 
N/A 
1 ms 
9/10.5 ms 
22 ms 


Rotation Speed 
Head Switch 
One-cyl Seek 
Average Seek 

Maximum Seek 





Table 2: Characteristics of the Seagate ST31200 drive. 


lookups based on disk address, C-FFS uses this mech- 
anism (rather than clustering or grouping information) 
to identify additional dirty blocks to write out with any 
given block. With this scheme, the main benefit provided 
by grouping is to increase the frequency with which 
dirty blocks are physically adjacent on the disk. C-FFS 
uses scatter/gather support to deal with non-contiguity 
of blocks in the cache. Were scatter/gather support not 
present, C-FFS would rely on more complicated extent- 
based memory management to provide contiguous re- 
gions of physical memory for reading and writing of 
both small file groups and large file clusters. 


4 Performance Evaluation 


This section reports measurements of our C-FFS im- 
plementation, which show that it can dramatically im- 
prove performance. For small file activity (both reads 
and writes), we observe order of magnitude reductions 
in the number of disk requests and factor of 5-7 im- 
provements in performance. We observe no negative 
effects for large file I/O. Preliminary measurements of 
application performance show improvements of 10-300 
percent. 


4.1 Experimental Apparatus 


All experiments were performed on a PC with a 
120 MHz Pentium processor and 32 MB of main mem- 
ory. The disk on the system is a Seagate ST31 200 (see 
Table 2). The disk driver, originally taken from NetBSD, 
supports scatter/gather I/O and uses a C-LOOK schedul- 
ing algorithm [Worthington94]. The disk prefetches se- 
quential disk data into its on-board cache. During the 
experiments, there was no other activity on the system, 
and no virtual memory paging occurred. In all of our 
experiments, we forcefully write back all dirty blocks 
before considering the measurement complete. There- 
fore, our disk request counts include all blocks dirtied 
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by a particular application or micro-benchmark. 

The disk drive used in the experiments is several years 
old and can deliver only one-third to one-half of the 
bandwidth available from a state-of-the-art disk (as seen 
in Table 1). As a result, we believe that the perfor- 
mance improvements shown for embedded inodes and 
explicit grouping are actually conservative estimates. 
These techniques depend on high bandwidth to deliver 
high performance for small files, while the conventional 
approach depends on disk access times (which have im- 
proved much less). 

The C-FFS implementation evaluated here is the de- 
fault file system for the Intel x86 version of the exok- 
ermel operating system [Engler95]. It includes most of 
the functionality expected from an FFS-like file system. 
Its major limitations are that it currently does not sup- 
port prefetching or fragments (the units of allocation are 
4 KB blocks). Prefetching is more relevant for large files 
than the small files our techniques address and should 
be independent of both embedded inodes and explicit 
grouping. Fragments, on the other hand, are very rel- 
evant and we expect that they would further increase 
the value of grouping, because the allocator would ex- 
plicitly attempts to allocate fragments within the group 
rather than taking the first available fragment or trying to 
optimize for growth by allocating new fragment blocks. 

We compare our C-FFS implementation to itself with 
embedded inodes and explicit grouping disabled. AJ- 
though this may raise questions about how solid our 
baseline is, we are comfortable that it is reasonable. 
Comparisons of our restricted C-FFS (without embed- 
ded inodes or explicit grouping) to OpenBSD’s FFS on 
the same hardware indicate that our baseline is actually 
as fast or faster for most file system operations (e.g., 
twice as fast for file creation and writing, equivalent 
for deletion and reading from disk, and much faster 
when the static file cache size of OpenBSD limits per- 
formance). The performance differences are partially 
due to the system structure, which links the file sys- 
tem code directly into the application (thereby avoiding 
many system calls), and partially due to the file sys- 
tem implementation (e.g., aggressively clustering dirty 
file blocks based on physical disk addresses rather than 
logical relationships). 

For our micro-benchmark experiments, we wanted to 
recreate the non-adjacency of on-disk placements ob- 
served on our file servers. Our simplistic approach was 
to modify the allocation routines. Rather than allocating 
a new file’s first block at the first free location within the 
cylinder group, we start looking for free space at other 
locations within the cylinder group. For half of the files 
ina directory, we start looking at the last direct block of 
the previous entry. For the other half, we start looking 
at a random location within the cylinder group. The 


resulting data layouts are somewhat better than those 
observed on our file server, and therefore should favor 
the conventional layout scheme. We do not modify the 
inode allocation routines in any way, so inodes allocated 
in sequence for files in a given directory will tend to 
be packed into inode blocks much more densely than 
was observed on our server. Again, this favors the con- 
ventional organization. For the non-microbenchmark 
experiments, we do not do this. 


4.2 Small File Performance 


Figure 5 compares the throughput of | KB file opera- 
tions supported by our prototype with embedded inodes, 
with explicit grouping, with neither and with both. The 
micro-benchmark, based on the small-file benchmark 
from [Rosenblum92], has four phases: create and write 
10000 1KB files, read the same files in the same or- 
der, overwrite the same files in the same order, and then 
remove the files in the same order. To avoid the cost 
of name lookups in large directories, the files are spread 
among directories such that no single directory has more 
than 100 entries. In this comparison, both file systems 
are configured to use synchronous writes for metadata 
integrity maintenance (as is common among UNIX file 
systems). 

For file creation, we observe a twentyfold reduction in 
the number of disk requests necessary when using both 
embedded inodes and explicit grouping. This results in 
a sevenfold increase in throughput. Half of the reduction 
in the number of disk requests comes from eliminating 
the synchronous writes required for integrity (by mak- 
ing the name and inode updates atomic) and half comes 
from writing the blocks for several new files with each 
disk request. It is interesting to note that half of the disk 
requests that remain are synchronous writes required be- 
cause additional directory blocks are being forced into 
the directory’s first group. This cost could be eliminated 
by a better integrity maintenance scheme (see below) or 
by not shuffling disk blocks in this way (which could 
involve a performance cost for later directory scanning 
Operations). It is also interesting to observe that C-FFS 
with both embedded inodes and explicit grouping sig- 
nificantly outperforms C-FFS with either of these tech- 
niques alone (5 times fewer disk requests and 3—4 times 
the create/write throughput). 

For file read and overwrite, we observe an order of 
magnitude reduction in the number of disk requests when 
using explicit grouping. This increases throughput by 
factors of 4.5-5.5. For these operations, embedded in- 
odes alone provide marginal improvements. However, 
embedded inodes do provide measurable improvement 
when explicit grouping is in use, once again making the 
combination of the two the best option. 
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Figure 5: Small file throughput when using synchronous writes for metadata integrity. 
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Figure 6: Small file throughput when using soft updates for metadata integrity. The “remove” throughput values are 
off the scale, ranging from 1500-2000 files per second. 
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For file deletion, we see a twofold decrease in the 
number of disk requests when using embedded inodes, 
since embedding the inodes eliminates one of the two 
sequencing requirements for file deletion. The second 
sequencing requirement, which relates to the relation- 
ship between the inode and the free map, remains. The 
result is a 250% increase in file deletion throughput, 
provided both by the reduction in the number of disk 
requests and improved locality (i.e., the same block gets 
overwritten repeatedly as the multiple inodes that it con- 
tains are re-initialized). 

Because there exist techniques (e.g., soft updates 
[Ganger94]) that have been shown to effectively elimi- 
nate the performance cost of maintaining metadata in- 
tegrity, we repeat the same experiments with this cost 
removed. Figure 6 shows these results. We have not yet 
actually implemented soft updates in C-FFS, but rather 
emulate it by using delayed writes for all metadata up- 
dates — [Ganger94] shows that this will accurately pre- 
dict the performance impact of soft updates. The change 
in performance that we observe for the conventional file 
system is consistent with previous studies. 

With the synchronous metadata writes removed, we 
still observe an order of magnitude reduction in disk 
requests when using explicit grouping for create/write, 
read and overwrite operations. The result is throughput 
increases of 4-7 times. Although the performance bene- 
fit of embedded inodes is lower for the create/write phase 
(because synchronous writes are not an issue), there is 
still again because embedded inodes significantly reduce 
the work involved with allocation for explicit grouping. 
As before, the combination of embedded inodes and ex- 
plicit grouping provides the highest throughput for both 
the read and overwrite phases. With synchronous writes 
eliminated, file deletion throughput increases substan- 
tially. Although it has no interesting effect on perfor- 
mance (because resulting file deletion throughput is so 
high), embedding inodes halves the number of blocks 
actually dirtied when removing the files because there 
are no separate inode blocks. 


4.3 File System Aging 


To get a handle on the impact of file system fragmen- 
tation on the performance of C-FFS, we use an aging 
program similar to that described in [Herrin93]. The 
program simply creates and deletes a large number of 
files. The probability that the next operation performed 
is a file creation (rather than a deletion) is taken from a 
distributioncentered around a desired file system utiliza- 
tion. After reaching the desired file system utilization for 
the first time, the aging program executes some number 
of additional file operations taken from the same distri- 
bution. The size of each file created is taken from the 
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Figure 7: Small file throughput with embedded inodes 
and explicit grouping after aging the file system. Before 
running the small file micro-benchmark (without soft 
updates), the file system was aged by filling the file sys- 
tem to a desired capacity and then executing 1 00000 file 
create and delete operations. To increase the impact of 
the aging, we performed these experiments on a 128 MB 
partition. The four bars for each phase represent C-FFS 
performance for a fresh file system, for a 50%-full aged 
file system, for a 70%-full aged file system and for a 
90%-full aged file system, respectively. 


distribution measured on our file servers. 

Figure 7 shows performance for the small file micro- 
benchmark after the file system has been aged. As ex- 
pected, aging does have a significant negative impact 
on performance. At 70% capacity, the throughputs for 
the first three phases decrease by 30-40%. However, 
comparing these throughputs to those reported in the 
previous section, C-FFS still outperforms the conven- 
tional file system by a factor of 3--4. Further, our current 
C-FFS allocation algorithms do not reduce or compen- 
sate for fragmentation of free extents. We expect that the 
degradation due to file system aging can be significantly 
reduced by better allocation algorithms. 


4.4 Large File Performance 


Although C-FFS focuses on improving performance for 
small files, it is essential that it not reduce the perfor- 
mance that can be realized for large files. To verify 
that this is the case, we use a standard large file micro- 
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Figure 8: Large file bandwidth. 


benchmark, which allocates and writes a large (32 MB) 
file sequentially, reads it sequentially and overwrites it 
sequentially. The results (shown in Figure 8) allow us 
to make two important points. First, using embedded 
inodes and explicit grouping has no significant effect 
on large file performance. Second, the C-FFS proto- 
type, which supports clustering of large file data,delivers 
most of the disk’s available bandwidth to applications for 
large files. (We believe that the somewhat disappoint- 
ing file write bandwidth values are caused by software 
inefficiency; we have verified that it is not due to poor 
clustering.) 


4.5 Applications 


Figure 9 shows performance for four different applica- 
tions which are intended to approximate some of the ac- 
tivities common to software development environments. 
As expected, we see significant improvements (e.g., a 
50-66% reduction in execution time) for file-intensive 
applications like “pax” and “rm”. For gmake, we see 
a much smaller improvement of only 10%. While such 
a small improvement could be viewed as a negative re- 
sult, we were actually quite happy with it because of 
the extremely untuned nature of certain aspects of our 
system (which has just barely reached the point, at the 
time of this writing, where such applications can be run 
at all). In particular, process creation and shutdown 
(which are used extensively by the “gmake” applica- 
tion) are currently expensive. As a result, the time re- 
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Figure 9: Application performance. “pax” uses the PAX 
utility to unpack a compressed archive file containing 
a substantial portion of the source tree for the default 
exokerel library operating system (about 1000 C files). 
“rm” uses the UNIX RM utility to recursively remove 
the directory tree created by “pax”. “gmake” compiles 
a sub-directory containing 32 C files. “clean” removes 
the newly created object files from “gmake”. 


quired for “gmake” on our exokernel system is currently 
250% greater than the corresponding execution time 
on OpenBSD. The same absolute improvement given 
a baseline equal to that of OpenBSD would represent 
a 25% reduction in execution time for this compute- 
intensive task. 

We believe that, for most applications, the name space 
provides useful information about relationships between 
files that can be exploited by explicit grouping. How- 
ever, there are some applications (e.g., the web page 
caches of many HTTP browsers) that explicitly random- 
ize file accesses across several directories in order to 
reduce the number of files per directory. For workloads 
where the name space is a poor indicator of access lo- 
cality, we expect grouping to reduce read performance 
slightly, because it reads from the disk more data than 
are necessary. 


5 Related Work 


Previous researchers and file system implementors have 
been very successful at extracting large fractions of 
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disks’ maximum bandwidths for large files. One sim- 
ple approach, which does have some drawbacks, is to 
increase the size of the basic file block [McKusick84]. 
Another is to cluster blocks (i.e., allocate them con- 
tiguously and read/write them as a unit when appropri- 
ate) [Peacock88, McVoy91]. Enumeration of a large 
file’s disk blocks can also be significantly improved by 
using extents (instead of per-block pointers), B+ trees 
[Sweeney96] and/or sparse bitmaps [Herrin93]. We 
build on previous work by trying to exploit disk band- 
width for small files and metadata. 

To increase the efficiency of the many small disk 
requests that characterize accesses to small files and 
metadata, file systems often try to localize logically 
related objects. For example, the Fast File System 
[McKusick84] breaks the file system’s disk storage 
into cylinder groups and attempts to allocate most 
new objects in the same cylinder group as related 
objects (e.g., inodes in same cylinder group as con- 
taining directory and data blocks in same cylinder 
group as owning inode). Similarly, several researchers 
have investigated the value of moving the most pop- 
ular (i.e., most heavily used) data to the centermost 
disk cylinders in order to reduce disk seek distances 
[Vongsathorn90, Ruemmler91, Staelin91]. As described 
in Section 2, simply locating related objects near each 
other offers some performance gains, but such locality 
affects only the seek time and is thus limited in scope. 
Co-locating related objects and reading/writing them as 
a unit offers qualitatively larger improvements in perfor- 
mance. 

Immediate files are a form of co-location for small 
files. The idea, as proposed by [Mullender84], is to ex- 
pand the inode to the size of a block and include the first 
part of the file in it. They found that over 60% of their 
files could be kept with the inode ina single block (and 
therefore both could be read or written with a single disk 
request). Another way to look at the same idea is simply 
that the inode is moved to the first block of the file. One 
potential down-side to this approach is that it replaces 
co-location of possibly related inodes (in an inode block) 
with co-location of an inode and its file data, which can 
reduce the performance of operations that examine file 
attributes but not file data. One simple application of 
the basic idea, however, is to put the data for very smal] 
files in the “normally” sized inode, perhaps replacing 
the space used by block pointers. 

The log-structured file system’s answer to the disk 
performance problem is to delay, remap and cluster 
all new data, only writing large chunks to the disk 
[Rosenblum92]. So long as neither cleaning nor read 
traffic represent significant portions of the workload, this 
will offer the highest performance. Unfortunately, while 
it may be feasible to limit cleaning activity to idle pe- 


riods [Blackwell95], anecdotal evidence, measurements 
of real systems (e.g., [Baker91]), and simulation studies 
(e.g., [Dahlin94]) all suggest that main memory caches 
have not eliminated read traffic as hoped. Our work at- 
tempts to achieve performance improvements for both 
reads and writes of small files and metadata. While we 
are working within the context of a conventional update- 
in-place file system, we could easily see co-location 
being used in a log-structured file system to improve 
performance when reads become necessary. 

One of the extra advantages of embedded inodes 
is the elimination of one sequencing constraint when 
creating and deleting files. There are several more 
direct and more comprehensive approaches to reduc- 
ing the performance cost of maintaining metadata in- 
tegrity, including write-ahead logging [Hagmann87, 
Chutani92, Journal92, Sweeney96], shadow-paging 
(Chamberlin81, Stonebraker87, Chao92, Seltzer93] and 
soft updates [Ganger95]. As shown in Section 4, our 
work complements such approaches. 

Of course, there is a variety of other work that has 
improved file system performance via better caching, 
prefetching, write-back, indexing, scheduling and disk 
array mechanisms. We view our work as complementary 
to these. 


6 Discussion 


The C-FFS implementation described and evaluated in 
this paper is part of the experimental exokerne] OS 
[Engler95]. We have found the exokernel to be an ex- 
cellent platform for systems research of this kind. In 
particular, our first proof-of-concept prototype was ex- 
tremely easy to build and test, because it did not have 
to deal with complex OS internals and it did not have 
to pay high overheads for being outside of the operating 
system. 

However, the system-related design challenges for file 
systems in an exokermel OS (which focuses on distribut- 
ing control of resources among applications) are differ- 
ent from those in a more conventional OS. An imple- 
mentation of C-FFS for OpenBSD is underway, both to 
allow us to better understand how it interacts with an 
existing FFS and to allow us to more easily transfer it to 
existing systems. Early experience suggests that C-FFS 
changes some of the locking and buffer management as- 
sumptions made in OpenBSD, but there seem to be no 
fundamental roadblocks. 

Inthis paper, we have compared C-FFS toan FFS-like 
file system, both with and without additional support for 
eliminating the cost of metadata integrity maintenance. 
Although we have not yet performed measurements, it is 
also interesting to compare C-FFS to other file systems 
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(in particular, the log-structured file system). We believe 
that C-FFS can match the write performance of LFS, 
so long as the name space correctly indicates logical 
relationships between files. For read performance, the 
comparison is more interesting. C-FFS will perform best 
when read access patterns correspond to relationships in 
the name space. LFS will perform best when read access 
patterns exactly match write access patterns. Although 
experiments is needed, we believe that C-FFS is likely to 
outperform LFS in many cases, especially when multiple 
applications are active concurrently. 

In this paper, we investigate co-locating files based on 
the name space; other approaches based on application- 
specific knowledge are worth investigating. For exam- 
ple, one application-specific approach is to group files 
that make upa single hypertext document [Kaashoek96 J. 
We are investigating extensions to the file system inter- 
face to allow this information to be passed to the file 
system. The result will be a file system that groups files 
based on application hints when they are available and 
name space relationships when they are not. 

Our experience with allocation for small file grouping 
is preliminary, and there are a variety of open questions. 
For example, although the current C-FFS implementa- 
tion allows only one block from a file to belong to a 
group, we suspect that performance will be enhanced 
and fragmentation will be reduced by allowing more 
than one. Also, C-FFS currently allows a group to be 
extended to include a new block even if it must also in- 
clude an unrelated block that had (by misfortune) been 
allocated at the current group’s boundary. We believe, 
based on our underlying assumption that reading an extra 
block incurs a small incremental cost, that this choice is 
appropriate. However, measurements that demonstrate 
this and indicate a maximum size for such holes are 
needed. Finally, allocation algorithms that reduce and 
compensate for the fragmentation of free space caused 
by aging will improve performance significantly (by 40— 
60%, according to the measurements in Section 4.3). 


7 Conclusions 


C-FFS combines embedded inodes and explicit group- 
ing of files with traditional FFS techniques to obtain high 
performance for both small and large file I/O (both reads 
and writes). Measurements of our C-FFS implementa- 
tion show that the new techniques reduce the number 
of disk requests by an order of magnitude for standard 
small file activity benchmarks. For the system under 
test, this translates into performance improvements of 
a factor of 5-7. The new techniques have no negative 
impact on large file I/O; the FFS clustering still de- 
livers maximal performance. Preliminary experiments 
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with real applications show performance improvements 
of 10-300 percent. 
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Abstract 


Current generations of hard disk drives use a tech- 
nique known as zoned constant angular veloc- 
ity (ZCAV), taking advantage of the geometry to 
increase total disk capacity by varying the number 
of disk sectors per track with the distance from the 
spindle. A side effect of this is that the transfer rate 
also varies with sector address. We analytically esti- 
mated and measured this effect on file system perfor- 
mance on a BSD Fast File System, showing a drop of 
roughly 25% in peak transfer rate depending on head 
position. We also show that, while ZCAV effects 
cannot be ignored, a simple linear model adequately 
estimates the performance from the few parameters 
normally available in disk drive spec sheets. 


1 Introduction 


Many magnetic disk drives use a technique known as 
zoned constant angular velocity (ZCAV), taking ad- 
vantage of the geometry to increase total disk capac- 
ity by varying the number of disk sectors per track 
with the distance from the spindle. A side effect of 
this is that the transfer rate also varies with block 
address. 

Despite some excellent recent work on model- 
ing the behavior of disk drives [10, 14], the effects 
of ZCAV have generally not been taken into ac- 
count in the design of file systems. Worthington 
et al [13] built a disk model which includes zone 
information, but the emphasis of their work is on 
disk scheduling algorithms to reduce latency, rather 
than improve throughput. Ghandeharizadeh has 
suggested [4] that file placement be adjusted based 
on access history to take advantage of ZCAV ef- 
fects, but no work has measured the effects directly. 


*This research was sponsored by the Advanced Research 
Projects Agency under Contract No. DABT63-93-C-0062. 
Views and conclusions contained in this report are the au- 
thors’ and should not be interpreted as representing the offi- 
cial opinion or policies, either expressed or implied, of ARPA, 
the U.S. Government, or any person or agency connected with 
them. 


The Microsoft Tiger Video Server [2] uses a sim- 
ple placement algorithm in which primary data is 
placed on outer tracks and secondary (redundant, 
infrequently-accessed) data is placed on inner tracks. 

We estimated and measured the effect of ZCAV 
on file system performance on a BSD fast file sys- 
tem, showing a drop of roughly 25% in peak trans- 
fer rate depending on head position. We also show 
that, while ZCAV effects cannot be ignored, a simple 
linear model adequately estimates the performance 
from the few parameters normally available in disk 
drive spec sheets. 

The rest of the paper is organized as follows. 
ZCAV is explained in detail in section 2. In section 3 
we extract the zoning information for the disk drive 
used in our experimental analysis. In the follow- 
ing section an analytic model for estimating ZCAV 
drive performance is presented. Then, our experi- 
mental setup is described, followed by our measured 
results and conclusions. 


2 Zoned Constant Angular 
Velocity 


Magnetic disk drives consist of one or more rotat- 
ing platters on a common spindle. Data is written 
and read by magnetic heads, generally one per sur- 
face (often with a spare surface, so that the num- 
ber of heads is one less than twice the number of 
platters). A track is a concentric circle on one sur- 
face. The collection of tracks at the same distance 
from the spindle on each surface constitute a cylin- 
der. A track consists of a number of sectors (occa- 
sionally called blocks), the smallest unit of data that 
can be read or written by the drive (typically 512 
or 1024 bytes, but theoretically any number). The 
triple <cylinder ,head, sector> uniquely defines a 
location on the drive. See [10, 13] for good introduc- 
tions to disk architecture. 

ZCAV is a technique adopted by hard disk man- 
ufacturers to increase the capacity of disk drives. 
Outer tracks, which are longer, contain more sec- 
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tors than the shorter inner tracks. The cylinders are 
grouped into zones that all have the same number 
of sectors per track. Some manufacturers refer to 
this as Zoned Bit Recording, ZBR. It is referred to 
as notches or a notched drive in the Small Computer 
Systems Interface (SCSI) specification [1]. 


As a side effect of this, since the time per rota- 
tion is constant, the number of sectors read per sec- 
ond (and hence the transfer rate) is higher on outer 
tracks, The read and write electronics must be able 
to keep up with the higher data rates required. 


Compact disks (hence, CD-ROM) and old 400KB 
and 800KB Macintosh floppy drives achieved sim- 
ilar increases in density by varying the rotation 
speed to achieve constant linear velocity. For high- 
performance hard disk drives this is impractical, 
since each seek also means fighting high angular mo- 
mentum to reach the correct speed, increasing the 
latency on seeks to an unacceptable level. 


As table 1 shows!, the transfer rate of the outer 
zones of current disk drives from a major manufac- 
turer exceeds that of the inner zones by factors rang- 
ing from 1.45 to 1.9. It is interesting to note that 
the disks with the highest capacity are not neces- 
sarily those with the highest ratio of inner to outer 
transfer rate. 


The ST31200, for example, falls off from 47.2 to 
26.8 Mbps, a drop of 43%. Thus, if the disk is op- 
erating mainly in the inner regions of the disk, per- 
formance can be expected to fall to just over half 
of the peak rate. Although not generally stated in 
the user manuals for the disk drives, empirical evi- 
dence indicates that the lower-numbered blocks (for 
a SCSI command set interface) are stored on the 
outer tracks. 


Note that these transfer rates are internal pre- 
format transfer rates; we will use this information 
to calculate the user data rate in the next section. 


Manufacturers sometimes report an “average” 
number of sectors per track for ZCAV disk drives. 
This number appears to be arrived at by totalling 
the number of sectors in the drive and dividing by 
the number of tracks. It does not attempt to reflect 
the fact that a higher percentage of the sectors are 
in tracks with more sectors. This average is useful 
for filling in the BSD disk format information (see 
the manual pages for fs and newfs), which retains 
the cylinder, head, sector model. 


'Most of these values were retrieved from Seagate’s web 
site (http://www. seagate.com), but the availability of data 
there varies. 
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3 Determining Zone 
Information 


SCSI is a commonly used interface for disk drives, 
and all of the drives we deal with in this paper have 
SCSI interfaces. At the SCSI command level, sectors 
are referred to by a logical block address, which the 
device controller maps to a physical location. 

Some information about the disk geometry is of- 
ten available through the MODE SENSE Notch and 
Partition Page on SCSI disk drives. This page 
reports the number of notches. The two drives used 
for this paper, the ST31200 and the ST11200, both 
report 23 notches in this page. On some drives 
it is possible to read some information about each 
zone using MODE SELECT and MODE SENSE. However, 
not all drives implement this functionality. The 
ST31200 supports this, but the ST11200 does not. 
The ST31200 only reports the number of cylinders 
in a zone, however, not the number of sectors per 
track or the total number of sectors in the zone. 

More detailed information can be obtained by 
using SEND DIAGNOSTIC and RECEIVE DIAGNOSTIC 
with the TRANSLATE ADDRESS page. This provides 
the cylinder, head and sector number for each logi- 
cal block, allowing easy determination of the number 
of sectors on a track, as well as two other important 
performance factors: the delay incurred by switching 
tracks and by switching cylinders, measured in sec- 
tors. It is interesting to note that on a Sparc 20/51 
each address translation takes roughly 50 millisec- 
onds, clearly at least one order of magnitude more 
than the actual translation requires. The reason for 
this delay is currently unknown. 

The intratrack instantaneous transfer rate can be 
determined by multiplying the number of bytes per 
track by the revolutions per second, 


bytes revs 





* 
track second 


To find the sustained user rate for long transfers, 
this must be multiplied by the factor 


hes 
h*es+(h—-1)*9:4+ 9c 


where h is the number of heads (tracks per cylin- 
der), s is the sectors per track, g; is the track- 
switch skew (gap) (measured in sectors) and g, is the 
cylinder-switch skew (also in sectors). When reading 
continuously, the drive executes h — 1 track switches 
plus one cylinder switch, per cylinder read. 

Table 2 gives the detailed zone information 
for the Seagate S$T11200 used in these experi- 
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capacity | min internal | max internal | ratio 
(GB) xfer rate xfer rate 
iMbpel (Mbps) 


Barracuda $T11950 
Barracuda $T32171 


Elite ST 43400 


Decathlon S5850A 
Hawk 4 ST15230 
Hawk 2XL $T31051 
Elite ST 410800 
$T31200 

$T11200 





Table 1: Transfer Rates for a Variety of Seagate Disks 


es [ed gap aqjMBs 


289050 
330900 
459240 
504285 
594045 
775485 
822795 
916395 
1007640 
1144440 
1235565 
1281075 
1368675 
1452810 
1605885 
1771425 
1803450 
1947090 
2015970 
2080770 


WOAADoIR WN 





Table 2: Extracted Zone Information for $T11200 with Calculated Transfer Rates 
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ments. This data was obtained by a modified ver- 
sion of John DiMarco’s scsiinfo, using a SEND 
DIAGNOSTIC/RECEIVE DIAGNOSTIC RESULTS 
command pair with the TRANSLATE ADDRESS page 
for each block on the disk, then hand-extracting the 
zone boundaries. On disks that also support set- 
ting the active notch on the MODE SELECT Notch 
and Partition Page, it is possible to more directly 
extract the cylinder boundaries. The ST31200, for 
example, returns the notch size in cylinders, but not 
the total sectors in the notch. Determining the notch 
boundaries is also complicated by the track skew, 
sparing of sectors, and sector remaps. The capac- 
ity of each zone as listed does not take into account 
remapped or skipped sectors. 

In table 2, the first column is zone number, start- 
ing from the outer edge, in accordance with block 
numbering. startis the cylinder number for the start 
of the zone. cyls is the number of cylinders in the 
zone. heads, the number of data heads used, is con- 
stant for the whole disk drive. sec/trk is the number 
of sectors per track in the zone. zonesec, the to- 
tal number of sectors in the zone, is the product of 
the prior three columns; totsec is a running total of 
the zonesec column. The columns in the table la- 
beled trotgap and crotgap are the track and cylinder 
skew. MB/sec. is the intratrack transfer rate de- 
termined as above, and adjMBs is the rate adjusted 
by the track and cylinder skew, as above. As the 
table shows, the reduction in transfer rate caused 
by fewer sectors in a zone can sometimes be almost 
completely offset by a reduction in the track skew. 
Compared with the values of 23.2 to 40.6 Mbps inter- 
nal transfer rates cited in the manufacturer’s man- 
ual, the adjusted values are 29% lower, and represent 
reasonable “not toexceed” values for system transfer 
rates. It is also worth noting that the drive reports 
23 notches on the notch and partition page, but only 
twenty were discernable from the logical to physical 
block map. The transfer rate is graphed in figure 1, 
which is explained in detail in section 4. 


4 Analytic Approach to 
Estimating Performance 


In this section, we consider three abstract exam- 
ples, then analyze the disk drive used for the ex- 
periments. ‘Transfer rates here are quoted in sec- 
tors per revolution; multiplying by revolutions per 
second and bytes per sector (both constants) would 
give bytes/second. 

The first example is a hypothetical three-zoned 
disk drive. The outer zone is 100 tracks of 175 sec- 
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tors, the middlezone is 100 tracks of 137 sectors, and 
the inner zone is 100 tracks of 100 sectors. This is 
overly simplistic but the ratios arecommon. The to- 
tal capacity of the drive is 100*175+100*137+100* 
100 = 41200 sectors. Roughly 42% of the sectors are 
in the outer zone, 33% in the middle zone, and 24% 
in the inner zone. Figure 2 shows the transfer rate 
in each zone versus track number. Figure 3 plots the 
transfer rate versus block number. Note the differ- 
ent position of the boundary between zones relative 
to figure 2, due to the higher capacity of the outer 
zones. 

The “average” number of sectors per track is 
137. If we assume that each sector is accessed 
with equal frequency, the “average” transfer rate is 
(17500 * 175 + 13700 * 137 + 10000 * 100)/41200, or 
144 sectors/revolution, due to the higher probabil- 
ity of being in a high-sectors-per-track zone. This 
effect alone leads to an error of 5% when estimat- 
ing performance based solely on the mean number 
of sectors per track. 

As a second example, consider a more finely- 
grained zoning. Let the disk drive consist of one 
track of 175 sectors, one of 174 sectors, etc. down to 
an inner track of 100 sectors, for a total capacity C 


of 
C= ya *t; 


i€Z 


where s; is the number of sectors per track in zone 
i, t; is the number of tracks in the zone, and Z is the 
set of zones. In this case, s; = 175—7 and t; = 1, so 
this reduces to 


175 
Gs ¥ i = 10,450 


i=100 


sectors. The mean number of sectors per track is 
10, 450/76 = 137.5 

Figure 4 shows transfer rate versus track number. 
Figure 5 shows the transfer rate versus block ad- 
dress for this example. Visually, it is nearly linear. 
A small n? factor would be expected to cause the 
transfer rate to fall off more quickly at higher block 
numbers (fewer sectors per track mean fewer sectors 
per zone, meaning the advance to yet-smaller zones 
accelerates), as shown in figure 5. However, this fac- 
tor appears to be unimportant, to first order. 

The median transfer rate R,,eq is the transfer rate 
of block number C’/2. In this case, block 5225 is on 
track 143, so it has a transfer rate of 143, 4% higher 
than the mean sectors per track. 

The “average” transfer rate, again assuming equal 
probability of access for each sector, would be the 
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USENIX Association 1997 Annual Technical Conference 23 


me Transfer Rate as a Function of Btock Address 


transfer rate —— 


150 |- 








oD 
ow 
c 
2 100 |- 
2 
£ 
50 
0 —— intl 
0 5000 10000 15000 20000 25000 30000 35000 40000 
Block Address 
Figure 3: Three large zones, transfer rate v. block no. 
6 Transfer Rate as a Function of Track Number 
transfer rate —— 
150 
2 
oS 
c 
S 100 
o 
c 
g 
KE 
50 
0 
0 10 20 30 40 50 60 70 
Track Number 


Figure 4: Single-track Zones, transfer rate v. track no. 





24 1997 Annual Technical Conference USENIX Association 


USENIX Association 


Transfer Rate as a Function of Block Address 


200 


150 


100 


Transfer Rate 


50 


0 2000 4000 





6 
Block Address 


transfer rate —— 
linear estimate 


000 8000 10000 


Figure 5: Single-track Zones, transfer rate v. block no. 


sum of the transfer rates for each of the individual 
blocks, divided by the total number of blocks: 


Doiez s} * ty 
G 
diez s? * ty 
Diez S$; *t; 


Rig = * 


This will not be equal to the “average” sectors 
per track reported by the manufacturer, times the 
rotations per second: 


avg.bytes _ bytes avg.sectors rotations 


second sector rotation second 


In our example 2, Rea simplifies to 7) 25 49 i/C = 
(1,801, 800 — 328,350)/C = 1,473,450/10, 450 = 
141, a very modest 2.5% increase from simply as- 
suming it to be the mean of the max and min trans- 
fer rates. For all practical purposes, therefore, we 
can estimate the mean transfer rate as the mean of 
min and max, when the zoning is fine-grained and 
roughly linear with track number. 

Figure 6 shows transfer rate versus block address 
for the Seagate ST31200, calculated based on the ex- 
tracted zone information. It clearly shows the effects 
of more of the blocks being in the outer zones. The 
median transfer rate is 3.6 MB/sec, 10% higher than 


the 3.25 arrived at by averaging the max and min 
rates. The average transfer rate, assuming each sec- 
tor has equal probability of being accessed, is 3.47, 
still 6% higher than 3.25. Again, the curve varies 
only slightly from linear. 

Figure 1 shows the transfer rate plotted against 
sector address for the ST11200 used for these ex- 
periments. A simple linear estimate is also plotted, 
running from the transfer rate at the outermost zone 
to the innermost zone. This shows a rough fit, with 
the maximumerror from the true rate being approx- 
imately 8%. Thus, while far from perfect, this ex- 
ceedingly simple model is significantly more accu- 
rate than assuming a fixed transfer rate, which may 
vary by 40%. In addition, this can be easily esti- 
mated from the data sheets typically supplied with 
disk drives. 

A recommended first-order estimate of trans- 
fer rate, simple enough to be implemented in a 
guaranteed-I/O-rate file system, would therefore be 


R2) = O.7rnge — 2mas = Pin) yy (1) 


where C’ is the disk capacity and Tmaz and Trmin 
are the maximum and minimum internal transfer 
rates reported by the disk drive manufacturer. The 
factor 0.7 comes from our observation in section 3 
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Figure 6: ST 31200 Zones with Calculated Transfer Rates 


that transfer rates adjusted for sector overhead, er- 
ror correction and track and cylinder skew results 
in a drop of approximately 29% from the manufac- 
turer’s listed transfer rates, which are instantaneous 
bit rates at the read/write head. Because this data 
is readily available, this factor can be incorporated 
quickly and easily by file system and device driver 
designers, without the necessity of tediously test- 
ing each possible disk drive. This transfer rate, of 
course, must be adjusted by the system’s ability to 
sustain the I/O rate; as shown above, for a Sparc 
10 running SunOS and a FFS, reads can run at de- 
vice speeds, while writes run at approximately 80% 
of theoretical. 


5 Experiment 


5.1 Experimental Setup 


These experiments were conducted on a Sparcsta- 
tion 10 with 64 MB of main memory, and a 1.05 
GB Seagate ST11200N disk drive. The actual band- 
width of this disk drive, as shown in table 2, varies 
from approximately 2.06 to 3.62 MB/sec., a factor 
of 1.75. Any read or write rate that exceeds that has 
clearly been the beneficiary of caching, either the file 
system’s buffer cache or the disk drive’s data block 
cache. According to the manual [11], this disk drive 
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has 23 zones, or notches, and an average (mean) of 
73 sectors per track, 15 heads, 1,872 cylinders for 
a total of 28,080 tracks. The drive rotates at 5,411 
rpm. Write caching at the disk is disabled; all writes 
are synchronous. 


The file system is a SunOS UFS, essentially a BSD 
fast file system [5]. The partition used for these ex- 
periments begins at sector number 655,200 and ex- 
tends 687MB to the end of the disk, as reported by 
dkinfo. Thus, according to table 2, the partition 
starts at a transfer rate of 3.24 MB/sec. and falls to 
2.06, a drop of 36%. Unfortunately, due to hardware 
and disk partitioning limitations, it was not possible 
at the time this experiment was conducted to cover 
the entire span of a disk. 


The basic experiment runs a loop that executes 
a modified version of Tim Bray’s bonnie to write 
a 100MB file, unmount the partition (to clear the 
cache and commit all modified metadata), then read 
the file back. Then the script records the Bonnie 
data file layout, deletes the file, writes a 10MB file 
to the system, and repeats. Thus, we have the re- 
sults for 100MB written at 1OMB intervals. The free 
space falls from approximately 610MB (user space 
available) to 105MB in 50 steps. 
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5.2 Experimental Data 


When measuring the effects of the ZCAV layout on 
file system performance, care must be taken as nu- 
merous other factors can contribute to changes in 
performance. They include: 


e distance from metadata (increased seek times) 
e free space fragmentation 

e CPU performance and system loading 

e buffer cache page replacement performance 


Of course, the effect on performance will vary dra- 
matically with the file system structure, which is 
generally operating-system specific; this is covered 
in the following section. 

Our data shows that the write rate varies by a 
factor of 1.33 (2528 KB/sec. v. 1900 KB/sec., a 
drop of 25%) depending on head position, even over 
the limited range of our experiments. Evaluating the 
reads is more difficult due to the high variability, but 
if we choose the means from the same data runs as 
the writes, we see a 23% drop, 3295 KB/sec. v. 2547 
KB/sec. at, respectively, 608 and 270 MB free. 

Figure 7 shows the mean of ten runs” of our 
100/10 benchmark. The error bars are 90% confi- 
dence intervals. Writes are also plotted with error 
bars, but they are too small to see at many data 
points. The results clearly show a drop in perfor- 
mance as the disk fills, until with about 280MB free 
space the curve takes a sharp, unexpected upward 
turn. 

The write values are lower than the theoretical 
maximum due to inevitable missed rotations. Since 
the disk is not allowed to cache write data, typically 
at least one rotation must be missed at the end of 
each write request. Additionally, occasionally the 
file system writes some metadata to the drive, re- 
quiring a seek and write, with ensuing missed ro- 
tations. The measured values are fairly consistently 
approximately 80% of the calculated values, indicat- 
ing approximately one missed rotation in five. 

Examining the layout for the data files created by 
the Bonnie benchmark (examined using a modified 
version of Keith Smith’s fsblks utility), as shown in 
figure 8, confirms the hypothesis that transfer rate 
is related to head position, as well as providing an 
explanation for the upward turn near the right-hand 
edge. The 10 MB filler files are getting laid down 


?The runs actually used for these calculations are numbers 
6 through 15; the first five represented progressive refinements 
of the measurement code and are discarded. 


with holes between them which go unused until the 
disk nears full. 

Returning to figure 7, the points labeled calc are 
calculated from run number 9, estimating the perfor- 
mance by integrating the transfer rate at each block 
in the file, using the transfer rates calculated for each 
zone in section 3. It clearly shows the same fea- 
tures (dips and peaks) as the write and read curves. 
The slight difference (approximately 5%) between 
the read curve and the calculated estimate is because 
the calculated estimate does not take into account 
real-world overheads for command processing and 
latency, and CPU time in the kernel and user pro- 
cess. This difference (for both read and write) will 
be system dependent and will have to be determined 
empirically. 

The points labeled lin-est are calculated using the 
linear estimate shown in equation 1. The largest dif- 
ference from the more correctly calculated values is 
7%. This error is significant, but the simplicity of 
this linear estimate (both in ease of determination 
and ease of use) may make it an acceptable sub- 
stitute for detailed zone calculations. The appar- 
ent better agreement of the linear estimate than the 
more realistic calculation above is coincidence; the 
linear estimate slightly underestimates performance 
compared to the non-linear effects of block address 
and geometry as described in section 4. Note that 
near the disk spindle (where the curve in figure 7 
dips at 280,000 KB free), the agreement between 
the block calculation and linear estimate is better, 
as we would expect. 

Reviewing our concerns expressed at the top of 
this section, our data has good repeatability, espe- 
cially on writes. The buffer cache issue has been 
addressed by clearing the cache via remounting the 
partition. Free space fragmentation proved to not be 
a problem. The CPU and other system components 
appear to be up to the task of fully utilizing the disk, 
clearly showing the ZCAV effects we expected. 


6 Conclusions 


6.1 Dependence File System 


Structure 


on 


One of the interesting aspects of this work is how re- 
peatable the data proved to be, especially for writes. 
This clearly demonstrated that the SunOS file allo- 
cation code depends on the current state only; the 
recent history of file creations and deletions does not 
alter future file system allocation decisions. Note 
that this does not mean that disk fragmentation is 


1997 Annual Technical Conference 


27 


28 


Transfer Rate (KB/sec) 


block address 


3500 


3000 


2500 


2000 


1500 
700000 


100000 


200000 


300000 


8 
3 
3 


500000 


600000 


700000 
700000 


Transfer Rate as a Function of Free Disk Space 


600000 500000 400000 300000 200000 
Free Space (KB) 
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not a general problem, only that once large areas of 
disk have been cleared of files, the reuse of that area 
is optimal (or at least predictable). 

McVoy showed that a UFS can achieve good write 
performance [6]%. The file blocks are allocated con- 
tiguously and I/Os are performed in clusters, much 
like an extent-based file system. Our work benefits 
from this work. 

Other possible file system structures, such as 
SGI’s XFS [12], may depend on more dynamic, and 
hence complex, data structures, and may therefore 
not allocate blocks as predictably. A log-based file 
system [9] or disk device {3] clearly will not, in their 
present forms, allocate blocks in a fashion amenable 
to improving throughput by careful choice of blocks. 


6.2 Impact on File System Allocation 


Policies 


As proposed by Ghandeharizadeh [4], the idea of 
including a measure of ZCAV effects into a dynamic 
file relocater is appealing. Such functionality could 
be included in a file system defragmenter, moving 
older, less-frequently-accessed files to lower-transfer- 
rate areas of the disk. 

It is clear that this effect needs to be taken into 
account for multimedia file systems and file sys- 
tems (such as SGI’s XFS [12] or Rangan’s multi- 
media ropes [8]) that provide guaranteed through- 
put. However, to date these have al] assumed disk 
bandwidth is fixed, rather than a function of block 
address. 

Larger files accessed in large chunks, for which 
transfer rate is likely to be more important, should 
be allocated to blocks at the outer edges (for a Sea- 
gate SCSI drive, the lower-numbered blocks). Small 
files obviously do not need to be placed in a high- 
transfer rate location, as their transfer time will be 
dominated by latency. Large files accessed in small 
I/O requests also will not take good advantage of 
the transfer rate. Determining which files will take 
advantage of this may require cooperation from ap- 
plications, perhaps via some form of hints [7]. 

Incorporating knowledge of the drive’s ZCAV na- 
ture into the cleaner for a log-structured file sys- 
tem may be useful. Data should be packed toward 
the spindle, so that the open area for upcoming log 
writes will get to use the outer, faster regions of the 
disk. This could be expected to improve the write 
performance of the LFS by 25% or more, at the ex- 


3McVoy noted, in fact, that exposing the drive’s variable 
geometry to the system will complicate block allocation, es- 
pecially in an extent-base FS. 


pense of slower reads, in keeping with the LFS phi- 
losophy. 


6.3. Future Work 


Obviously, we would like to try these experiments on 
a wider range of hardware and software platforms, 
especially different file systems. However, the point 
that performance varies with head position appears 
to have been adequately demonstrated. In particu- 
lar, our work should be repeated with different fam- 
ilies of disk drives from different manufacturers, to 
confirm both the hypothesis that ZCAV effects are 
user-visible, and the effectiveness of our proposed 
linear estimate. 


Ideally, a publicly-available bank of information 
on drive types and zone information should be cre- 
ated. As more drive developers adopt standard 
methods of determining the zone information, of 
course, determining this information at boot time 
or file system configuration time becomes more fea- 
sible. 


6.4 Conclusions 


We have explained the underlying motivations be- 
hind ZCAV disk drives, and demonstrated that it 
does have an effect on file system performance for a 
BSD FFS. We have shown that it is possible, though 
somewhat tedious, to extract this information from 
at least some disk drives directly. We have proposed 
that a simple linear model relating transfer rate to 
block address should be adequate for most purposes. 
The measurable effect, at 23-25%, is less than the 
physical difference of 36% between the inner and 
outer disk edges of the tested partition, but still too 
large to ignore in performance-critical applications. 
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The author may be contacted via email at 
rdvQ@isi. edu or rdv@alumni.caltech.edu. 


USENIX Association 


USENIX Association 
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Abstract 


In an operating system kemel. critical sections of code 
must be protected from interruption. This is traditionally 
accomplished by masking the set of interrupts whose 
handlers interfere with the correct operation of the criti- 
cal section. Because it can be expensive to communicate 
with an off-chip interrupt controller, more complex opti- 
mistic techniques for masking interrupts have been pro- 
posed. 

In this paper we present measurements of the 
behavior of the NetBSD 1.2 kernel, and use the 
Measurements to explore the space of kernel 
synchronization schemes. We show that (a) most critical 
sections are very short, (b) very few are ever 
interrupted, (c) using the traditional synchronization 
technique, the synchronization cost is often higher than 
the time spent in the body of the critical section, and (d) 
under heavy load NetBSD 1.2 can spend 9% to 12% of 
its time in synchronization primitives. 

The simplest scheme we examined, disabling all 
interrupts while in a critical section or interrupt handler, 
can lead to loss of data under heavy load. A more 
complex optimistic scheme functions correctly under 
the heavy workloads we tested and has very low 
overhead (at most 0.3%). Based on our measurements, 
we present a new model that offers the simplicity of the 
traditional scheme with the performance of the 
optimistic schemes. 

Given the relative CPU, memory, and device 
performance of today’s hardware, the newer techniques 
we examined have a much lower synchronization cost 
than the traditional technique. Under heavy load, such 
as that incurred by a web server, a system using these 
newer techniques will have noticeably better 
performance. 


1 Introduction 


Although the BSD kernel is traditionally single- 
threaded, it needs to perform synchronization in the face 
of asynchronous interrupts from devices. In addition, 
while processing an interrupt from a device, the kemel 
needs to block delivery of new interrupts from that 
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device. This is traditionally accomplished by communi- 
cating with an off-chip interrupt controller and masking 
interrupts. Historically, the routines that perform this 
service are named splintr, where intr is the name of the 
interrupt to disable. A spccial routine, splhigh, is used to 
disable all interrupts. 

Most, if not all, conventional hardware platforms 
allow device interrupts to be prioritized and assigned 
levels; a pending intennupt from a lower-level device can 
Not interrupt the processing of an interrupt from a 
higher-level device. This method protects critical 
sections (by disabling interrupts from all devices) and 
device drivers (by disabling interrupts from the same 
device). Interrupts can then be prioritized by their 
frequency and the time required to handle them. 

At Harvard we are in the process of developing a 
new operating system, the VINO kernel [Selt96]. In 
order to explore the space of kemel synchronization 
methods, we constructed analytic models of four basic 
interrupt handling schemes. We then used the NetBSD 
1.2 kernel to analyze the performance of each scheme, 
by measuring: 
¢ how often NetBSD enters critical sections, 
¢ how long critical sections last, 
¢ how often they are interrupted, 

« the frequency of different types of interrupts, and 
¢ the frequency with which these interrupts are 
delivered. 

We found that critical sections are generally very 
short (usually less than four microseconds), that little 
total time is spent in critical sections, and that the 
synchronization overhead of three of the four basic 
schemes we examined is nominal. We also found that 
the overhead of the traditional scheme can be as large as 
12% of CPU time. 

On many system architectures interrupt processing 
is handled by an off-chip programmable interrupt 
controller. On a Pentium PC the time required to access 
the off-chip interrupt controller is quite high, 
approximately three hundred cycles (2.5us on a 
120MHz Pentium). 

In addition to the off-chip interrupt controller, some 
CPUs include an on-chip instruction to mask all 
interrupts. On the Pentium this operation can be 
performed in five to ten cycles, two orders of magnitude 
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more quickly than communicating with the interrupt 
controller. 

If a critical section is short, the cost of going off- 
chip to disable and re-enable interrupts can be much 
greater than the time spent in the body of the critical 
section. For example, if the critical section adds an 
element to a linked list, it may only run for twenty or 
thirty cycles; if the kernel needs to go off-chip to 
synchronize, the time spent in the synchronization code 
is an order of magnitude more than the time spent in the 
body of the critical section. If we use the on-chip 
instructions to disable and re-enable interrupts, the 
synchronization cost is lowered to the point where it is 
less than that of the critical section code. 

Optimistic synchronization techniques are 
particularly well-suited to  short-duration critical 
sections that are infrequently interrupted. On entry to a 
critical section, instead of disabling interrupts, a global 
flag is set. If an interrupt is delivered while the flag is 
set, interrupts are only then disabled, and the interrupt is 
postponed until the flag is cleared. If no interrupt is 
delivered while the flag is set, the cost of 
synchronization is just the cost of setting and clearing 
the flag. 

In this paper we model four schemes, based on two 
variables: where to mask interrupts (on-chip or off-chip) 
and how to mask them (pessimistically or 
optimistically). 

We note that the techniques that use on-chip 
interrupt masking for synchronization may not be usable 
on multiprocessor systems. In symmetric multiprocessor 
systems, any processor may enter a critical section. 
Hence, any processor must be able to disable interrupts 
delivered to all processors. On this type of hardware the 
synchronization scheme must communicate either with 
the other processors (directly or through memory), or 
with an off-processor interrupt controller; in either case, 
it needs to go off-chip. An asymmetric multiprocessor 
(where only one processor runs operating system kermel 
code, and hence only one processor receives interrupts 
and enters critical sections) would be able to take 
advantage of the on-chip techniques we propose. 

In the remainder of the paper we examine our four 
schemes in detail and compare their costs. We use our 
results to derive a fifth scheme which synthesizes the 
benefits of the others. 

In Section 2, we discuss related work on 
synchronization. In Section 3, we propose our four 
schemes for handling interrupts. In Section 4 we discuss 
our experimental setup, and then in Section 5 measure 
the overhead of the four schemes. Section 6 examines 
the behavioral correctness of the on-chip schemes we 
propose, measuring the frequency of, and amount of 
time spent in, interrupt handlers. In Section 7 we discuss 
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our results and propose our new scheme, and conclude 
in Section 8. 


2 Related Work 


The interrupt handlers of real-time systems often run 
with interrupts disabled [Stan88], as in the V kernel 
(Berg86]. In this paper we explore similar techniques, 
using atomicity for synchronization, although we make 
no assumptions about real-time behavior of the system 
as a whole. 

Synchronization docs not require locking; other 
techniques have been proposed. For example, non- 
blocking synchronization [Herl90, Mass91, Green96], 
lock-free data structures [Wing92], and _ restartable 
atomic sequences [Bers92] are novel techniques 
proposed for synchronization. Some of these schemes 
require special hardware support, such as a compare- 
and-swap instruction, which is not available on all 
processors (e.g., the MIPS R3000 and the Intel 1386). 
These techniques are designed to work in environments 
where it is expensive (or impossible) to disable 
interrupts, and work best in environments where 
contention is minimal. The cost and complexity of these 
schemes is higher than the on-chip schemes we propose, 
and our analysis shows that under typical loads, such 
complexity is unnecessary. 

Similarly, Stodolsky et al. [Stod93] proposed 
decreasing the cost of synchronization by taking an 
optimistic approach. When the kernel makes a request to 
raise the global priority level, instead of communicating 
with the interrupt controller, the kernel saves the desired 
priority level in a variable. The critical section is then 
run, and on completion, the former priority level is 
restored. If no interrupts occurred during the critical 
section, the kernel was able to save the (high) cost of 
communicating with the interrupt controller, paying 
instead the (lower) cost of saving and restoring the 
logical priority level. 

If an interrupt is delivered while the kernel is in a 
critical section, the interrupt handler compares the 
logical priority level with the interrupt priority level. If 
the interrupt is higher priority, it is handled 
immediately; if it is lower priority, the system queues 
the interrupt and falls back to the pessimistic scheme, 
setting the actual priority level to match the logical 
level. The cost of setting the global flag is quite low 
compared to the cost of communicating with the 
interrupt controller, so if the kernel is rarely interrupted 
while in a critical section the performance of the 
optimistic technique is superior to that of the standard 
pessimistic technique. In the case of a null RPC 
microbenchmark, they saw a 14% __ performance 
improvement. 
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Our work begins with the ideas developed by 
Stodolsky and adds a second dimension of comparison, 
the performance difference seen when masking 
interrupts on-chip and off-chip. 

Mogul and Ramakrishnan [Mogul96] have 
explored a related issue. Interrupt processing can be 
split across multiple priority levels, a high-priority level 
that responds to the device interrupt and a low-priority 
level that processes the interrupt. When this is the case, 
it is possible for a flood of high priority device interrupts 
to cause the starvation of the low-priority processing. 
Mogul and Ramakrishnan refer to this state as receiver 
livelock. When the system reaches this state, their 
strategy is to disable intertupts and poll for new requests 
as the old ones are handled. 

Their techniques are most appropriate in the face of 
unsolicited 1/0 from network and serial devices. The 
kernel has little or no control over how quickly data is 
sent to this type of device. Unlike a network device, a 
disk device only generates interrupts in response to a 
request from the kemel. If the kernel is receiving disk 
interrupts more quickly than it can process them, the 
kemel can reduce the load by queueing fewer disk 
requests. A natural feedback loop is present; if the 
kemel is bogged down handling old work, little new 
work will be started. 

Mogul and Ramakrishnan’s strategy is related to 
ours. They find that, at times, it is more efficient to 
ignore new interrupts in order to process the ones that 
have already arrived. However, they propose 
dynamically switching between two schemes based on 
the current system load; we propose choosing a single 
scheme for handling all interrupts. Their work is 
compatible with ours; none of the synchronization 
schemes we analyze preclude use of their techniques. 


3 Synchronization Strategies 


The first axis of our strategy space is a comparison of 
optimistic and pessimistic schemes, as studied by Stod- 
olsky et al. The second axis of our space is a comparison 
of off-chip and on-chip interrupt management. This 
gives us four strategies to explore, as seen in Table 1. 

In the following sections we describe each scheme 
in detail, sketch its synchronization functions, and 
discuss its costs and benefits. 


| Pessimistic 
Off-chip | 
On-chip | 


Table 1: Synchronization Strategies Explored. On the 
x86, NetBSD 1.2 uses spl-optim, the Linux 1.3.0 kernel uses 
cli-pessim, and BSD/OS 2.1 uses spl-pessim. 


Optimistic 















spl-pessim spl-optim 





cli-pessim cli-optim 


3.1 Spl-Pessim 


The spl-pessim scheme is named after the kernel “set 
priority level” function (spl), which was named after the 
PDP-11 instruction of the same name. When a critical 
section is entered, a subset of interrupts are disabled by 
communicating with the off-chip programmable inter- 
tupt controller (PIC). 


crit_sec_enter() 
saved_mask = cur_mask 
PIC_mask(all) 


crit_sec_leave() 
PIC_mask(saved_mask) 


interrupt_handler(int intr_level) 
saved_mask = cur_mask 
PIC_mask(intr <= intr_level) 
handle interrupt 
PIC_mask(saved_mask) 


The benefit of this scheme is fine-grained control of 
which interrupts are disabled. The cost is high per- 
critical section overhead. 


3.2 Spl-Optim 

When a critical section is entered a variable is set to the 
logical priority level; the variable is restored on exit 
from the critical section. When an intermpt is delivered, 
the logical priority level is checked; if the interrupt has 
been masked, the interrupt is queued. The hardware 
interrupt mask is then set to be the logical mask (by 
communicating with the intertupt controller). 


crit_sec_enter() 
in_crit_sec = true 


crit_sec_leave() 
if (any queued interrupts) 
handle queued interrupts 
in_crit_sec = false 


interrupt_handler(int intr_level) 
if (in_crit_sec 
ll intr_level < cur_level) 

queue interrupt 

else 
saved_mask = cur_mask 
PIC_mask(intr <= intr_level) 
handle interrupt 
PIC_mask(saved_mask) 
handle queued interrupts 
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The benefit of this scheme is low per-critical 
section overhead, and fine-grained control over which 
interrupts are disabled. 

The cost of this scheme is communication with the 
off-chip interrupt controller if an interrupt is delivered 
when the kernel is in a critical section, and a higher code 
complexity than the pessimistic schemes. 


3.3 Cli-Pessim 


The cli-pessim scheme is named after the x86 instruc- 
tion that clears the interrupt flag, c/i. When a critical 
section is entered, all interrupts are disabled. This 
scheme is structurally similar to the sp/-pessim scheme, 
but instead of communicating with the PIC, on-chip 
instructions are used to disable and enable interrupts. 


crit_sec_enter() 
disable_all_interrupts() 


crit_sec_leave() 
enable_all_interrupts() 


interrupt_handler(int intr_level) 
crit_sec_enter() 
handle interrupt 
crit_sec_leave() 


The benefit of this scheme is a low per-critical 
section overhead. The cost is increased risk of dropped 
interrupts (if multiple interrupts are delivered during 
critical sections) and possible delay of high-priority 
interrupts while processing a lower priority interrupt or 
while the kere] is in a critical section. 


3.4 Cli-Optim 
When a critical section is entered, a global flag is set; it 
is cleared on exit from the critical section. When an 
interrupt is delivered, interrupts are disabled while the 
interrupt is processed. 


crit_sec_enter() 
in_crit_sec = true 


crit_sec_leave(int level) 
if (any queued interrupts) 
handle queued interrupts 
in_crit_sec = false 


interrupt_handler(int intr_level) 
if (in_crit_sec) 
queue interrupt 
else 
disable_all_interrupts() 
handle interrupt 
enable_all_interrupts() 
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This scheme is very similar to the spl-optim 
scheme, but, as with the cli-pessim scheme, instead of 
communicating with the PIC, interrupts are disabled and 
enabled through the use of on-chip instructions. 

The benefit of this scheme is low per-critical 
section overhead. Its cost is increased risk of lost 
interrupts if multiple interrupts are delivered during 
critical sections, and possible delay of high-priority 
interrupts while processing a lower priority interrupt or 
critical section. It also has a slightly higher code 
complexity than the cli-pessim scheme. 


3.5 Comparison of Schemes 


While a scheme’s performance is important, we 
must also verify its correctness. Devices are designed to 
queue a small number of pending interrupts for a short 
period of time; if interrupts are disabled for a long 
period of time, it is possible that multiple interrupts will 
be merged together or lost, and data buffers can 
overflow with an accompanying loss of data. In some 
cases the loss of an interrupt is a performance issue, not 
a correctness issue (e.g., TCP/IP packets are 
retransmitted if dropped). However, we need to be sure 
that the system as a whole behaves correctly in the face 
of heavy load. 

Fortunately, it is not necessary to test all four 
schemes for correctness. The cli-pessim scheme is the 
least responsive to external interrupts; if it behaves 
correctly in the face of heavy load, the other systems can 
do no worse. 

Note that use of a cli scheme does not preclude 
assigning priorities to interrupts. Although all interrupts 
are disabled while the system is in a critical section, 
pending interrupts can still be assigned priority levels, 
and higher-priority interrupts will take precedence over 
lower-priority interrupts when interrupts are once again 
enabled. (In our implementation of the cli-pessim kernel 
we assign the same priorities to interrupts as are used by 
standard NetBSD1.2, the spl-optim kernel to which we 
compare it.) 

The costs and benefits of the four strategies depend 
on a number of components. First, the cost of going off- 
chip to set the priority level needs to be measured, as 
does the cost of disabling interrupts on-chip. Second, 
the frequency with which interrupts are received, and 
the frequency with which an interrupt arrives while the 
kemel is in a critical section, must be measured. We 
must also determine the time spent in critical sections in 
order to leam whether a cli-pessim kernel will ignore 
interrupts for too long a period of time, with too high a 
risk of losing interrupts. 
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Scheme | Synchronization Overhead 


spl-pessim (number of critical sections entered) « (off-chip cost) 


(number of critical sections entered) ¢ (flag-setting cost) + 
(number of critical sections interrupted) ¢ (off-chip cost) 


(number of critical sections entered) ¢ (on-chip cost) 


(number of critical sections entered) « (flag-setting cost) + 
(number of critical sections interrupted) * (on-chip cost) 






spl-optim 









cli-pessim 
cli-optim 


Table 2: Synchronization Scheme Overheads. The model describing the overhead associated 
with the four schemes discussed. The costs are functions of five variables: critical sections 
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entered, critical sections interrupted, off-chip cost, on-chip cost, and flag-setting cost. 





The overhead of synchronization for each strategy 
can be described by a simple analytic model. The model 
must take into account: 
¢ number of critical sections entered: number of times 

a critical section was entered, per second. This factor 

is important when computing the overhead for any of 

the schemes. 

¢ number of interrupted critical sections: number of 
times a critical section was interrupted, per second. 
This is only a factor for the optimistic schemes. 

° off-chip cost: the time required to communicate with 
the interrupt controller, first raising the priority level, 
then lowering it. This is only relevant for the sp/ 
schemes. 

¢ on-chip cost: the time required to disable interrupts 

and re-enable them. This cost only applies to the cli 

schemes. 

flag setting cost: the time required to set the variable 

holding the mask. This cost is only a factor when 

using an optimistic scheme. 

The equations that describe the overhead for each 
scheme are found in Table 2. In the following section we 
discuss our experimental setup, and then in Section 5 we 
measure the component costs and derive a total cost for 
each scheme. 


4 Experimental Setup 


In this section we discuss the kernels that we con- 
structed and measured, and the hardware platform used. 


4.1 Kernels 


The x86 version of NetBSD 1.2, a derivative of 
4.4BSDLite, uses the spl-optim strategy for priority 
level management. We started with an off-the-shelf copy 
of NetBSD 1.2 for our sp/-optim measurements, With 
this kernel we were able to measure the frequency with 
which critical sections are interrupted. 

Starting with NetBSD 1.2 as a base, we developed a 
cli-pessim kernel. This kernel disables all interrupts any 


time a critical section or interrupt handler is entered, and 
enables all interrupts when the critical section or 
interrupt handler finishes. The cli-pessim kernel allowed 
us to accurately measure the length of time spent in 
critical sections, the number of critical sections, and the 
time required to handle each type of interrupt. 


4.2 Hardware 


We ran our tests on an x86 PC with a 120MHz Pentium 
processor and PCI bus. The memory system consisted of 
the on-chip cache (8KB i + 8KB d), a 512KB pipeline 
burst off-chip cache and 64MB of 60ns EDO RAM. 
There was one IDE hard drive attached, a 1080 MB, 
5400 RPM Westem Digital WDC31000, with a 64KB 
buffer and an average seek time of !0ms. The system 
included a BusLogic BT946C PCI SCSI controller; 
attached to it was a 1033 MB, 5400 RPM Fujitsu 
M2694ES disk, with a 10ms average seek time and a 
transfer rate of SMB/second, and a Sony 4x SCSI CD- 
ROM drive (model CDU-76S). The Ethemet adapter 
was a 1OMbps Western Digital Elite 16 with 16KB of 
100ns RAM. The serial ports were buffered (16550 
UART). 

Pentium processors include an on-chip cycle 
counter, which enable very precise timing 
Measurements. We read the counter at the start and at 
the end of each experiment; the difference was 
multiplied by the processor cycle time (8 1/3 
nanoseconds) to obtain the elapsed time. The cost of 
reading the cycle counter is roughly 10 cycles; where 
significant, the timing cost has been subtracted from our 
measurements. We include code to read the cycle 
counter in the Appendix. 


5 Synchronization Overhead 


As shown in Table 2, the synchronization overhead of 
each scheme is a function of five variables: the off-chip 
priority setting cost, the on-chip priority setting cost, the 
flag-setting cost, the number of times a critical section is 
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entered, and the number of times an interrupt arrives 
while the system is in a critical section. In this section 
we measure each component to derive the synchroniza- 
tion overhead for the four schemes. 


Critical Sections Per Second 


We ran four tests to estimate the number of critical 
sections entered per second under heavy load. These 
tests supply the value used for the number of critical 
sections entered variable in the equations above. They 
Measure the system under a mixed programming load, 
heavy web traffic, and high-speed serial traffic. 

The first two tests are the result of running the 
Modified Andrew Benchmark [Ouster90] on a local file 
system (Andrew local) and an NFS-mounted file system 
(Andrew NFS). The Modified Andrew Benchmark 
consists of creating, copying, and compiling a hierarchy 
of files, and was designed to measure the scalability of 
the Andrew filesystem [How88 ]. 
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Table 3: Critical Sections Per Second. The mean and 
maXimum number of critical sections entered, per second, 
measured with the cli-pessim kernel. We also measured the 
mean critical section duration, which is seen to be very short, 
especially relative to the cost of off-chip synchronization. 


WebStone 







Serial 115.2 


The third test was running the WebStone 2.0 
benchmark from Silicon Graphics [SG1I96}. WebStone 
was designed to test the performance of web servers 
under varying load. We installed the Apache 1.1.1! 
server on our test machine, and ran the WebStone clients 
(which generate http requests of the server) from a 
SparcStation 20. The maximum number of critical 
sections per second occurred with 50 client pro.esses; 
we report the results from this test. 

The fourth test measured the kernel’s behavior 
while receiving a 64KB file via tip(1) at 115,200 bps. 
This test measures the behavior of the kernel when it is 
presented with a high rate of short-duration interrupts. 
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We instrumented the kemel to measure the duration 
of each critical section, using the Pentium cycle counter. 
A user-level process polled the kernel once per second 
to retrieve the number of critical sections and the time 
spent in critical sections since the last poll. We use these 
results to derive the numbers in Table 4. 
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Figure 1: Critical Section Duration Distribution. 
Distribution of critical section durations for Andrew 
NFS; the other distributions had the same shape. 


13. 215 


We plotted the frequency of duration of critical 
sections for each test, and found that the distribution 
roughly follows the shape of an exponential distribution 
(see Figure 1 as an example; the other distributions had 
the same shape). The standard deviation of an 
exponential distribution is equal to the mean of the 
distribution, which in all measured cases is less than 
Sus. If we use the mean as an estimate of the standard 
deviation, we find that most critical sections take less 
than 10us; in the case of the Serial 115.2, we expect 
most to be less than 12 us. 


5.2 __ Interrupted Critical Sections 


In order to compute the overhead of the optimistic 
techniques we need to determine the percentage of 
critical sections that are interrupted. We reran the tests 
from Table 4 with the standard NetBSD 1.2 kemel 
(which uses the spl-optim scheme) and measured the 
number of critical sections interrupted per second. 
These results are shown in Table 4, on page 7. 

For comparison with the results of Table 4, we 
include the Mean Critical Sections Per Second 
measured using this kernel; these results are very similar 
to the values seen in Table 4. As we expected (based on 
the short duration of critical sections), we saw that a 
very small number of critical sections are interrupted. 


5.3. Synchronization Primitive Cost 


The synchronization overhead parameters were 
measured by performing each operation pair (set and 
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Mean Critical pa og 
Sections rea 

Sections 
per second per-second 

(std dev) (% intr) 
Andrew local 31904 (69%) 15 (0.05%) 

Andrew NFS 25172 (67%) 50 (0.2%) 
WebStone 33890 (33%) 300 (0.9%) 
Serial 115.2 45269 (8%) 140 (0.3%) 


Table 4: Interrupted Critical Sections. We measured the 
number of critical sections entered per second on the sp/-opiim 
kernel, and the number of times per second critical sections 
were interrupted (which causes the optimistic kernel to revert 
to pessimistic synchronization behavior). We include the 
percentage of critical sections interrupted. 


unset) in a tight loop 100,000 times and computing the 
mean and standard deviation. In all cases the standard 
deviation was less than 1% of the mean. The results of 
these measurements are given in Table 5. 

Because these tests were run in a tight loop, a 
minimal number of cache misses and other side-effects 
are seen in these results, hence the synchronization 
times we use are somewhat optimistic. Cache misses 
could have a large percentage impact on the on-chip and 
flag-setting measurements, because these instruction 
sequences are very short. However, because they are so 
short, even if all of the instructions and data referenced 
by the code cause cache misses, the total absolute 
additional cost (in cycles) would be very small. 





Synchronization Primitive Time 
off-chip (sp/-pessim) 2.5ps (304 cycles) 
on-chip (cli-pessim) 0.18u1s (21 cycles) — 

set flag (spl-optim, cli-optim) 0.05ps (6 cycles) 


Table 5: Synchronization Primitive Cost: Time required to 
set priority level off-chip, on-chip, and to set a global flag 
saving the logical priority level. The Pentium has a write 
buffer, so the flag-setting cost does not include time to write to 
main memory. 





The synchronization cost of each of the four 
schemes is directly driven by the costs shown in Table 5. 
The off-chip cost is the synchronization cost of the spl- 
pessim kernel (304 cycles, 2.5u1s), on-chip gives the 
synchronization cost the cli-pessim kernel (21 cycles, 
0.1 81s), and the flag setting cost is the synchronization 
cost of the optimistic kernels (6 cycles, 0.05p1s). As we 
immediately see by comparing the costs in Table 5 with 
the critical section durations seen in Table 4, the mean 
critical section duration seen is less than twice the off- 


chip synchronization cost, and more than three times the 
mean Serial 115.2 critical section cost. 


5.4 Synchronization Overhead 


Given the results in Table 3, Table 4, and Table 5, 
the synchronization cost and overhead for each of the 


four schemes can be computed. These costs are given in 
Table 6. 





Synchronization 
Overhead 

giherke Andrew Andrew Web _ Serial 

local NFS Stone 115.2 

spl-pessim 9.0% 6.5% 8.6% 11.9% 
spl-optim 0.2% 0.1% 0.3% 0.3% 
cli-pessim 0.7% 0.5% 0.6% 0.9% 
cli-optim 0.2% 0.1% 0.2% 0.2% 


Table 6: Synchronization Scheme Overhead. Percentage of 
time spent synchronizing critical sections using the described 
techniques. Computed using equations in Table 2 and 
measurements from Table 3, Table 4, and Table 5. 


We see that with the large number of critical 
sections per second seen under the Serial 115.2 test, the 
spl-pessim kernel can spend nearly 12% of its time in 
synchronization primitives. Even under the lightest 
measured load, the system spends more than 6% of its 
time in synchronization primitives. 

Another thing to note is that the other three 
techniques all have an overhead that is an order of 
magnitude less than that of the spl-pessim technique. 
Even though the sp/-optim technique goes off-chip for 
synchronization, because so few critical sections are 
interrupted (from 0.05% to 0.9%) the difference is 
negligible (between 0.2%-0.7%). 


6 Correctness 


With the knowledge that the absolute overhead of the 
three proposed schemes is very low, we must be sure 
that none of the schemes allow interrupts to be lost or 
overly delayed. 

The optimistic schemes do not delay high-priority 
interrupts, nor does the spi-pessim scheme; the cli- 
pessim scheme is the only one that masks all interrupts 
while in a critical section or interrupt handler. If the c/i- 
pessim scheme can operate correctly, the other schemes 
will also operate correctly. 

In Table 3 we saw that the mean critical section 
duration is between 0.6 and 4.6us. We are concerned 
that handling a device interrupt could take much longer, 
especially with programmed I/O devices, such as the 
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Duration Interrupts Meanie Min time Mediu Max time 
Total to handle time 

Test of Test per 3 to handle to handle 
(seconds) iene ues second ErEDD interrupt techandle interrupt 

(std dev) interrupt 
ide disk 1519 ps (7%) 2003s 
Scsi disk 65s (15%) 86us 
scsi cd-rom 27us (7%) 40us 
floppy 169s (57%) 271 ps 
serial-300 (8K) Sus (5%) 12us 
serial-38.4 (64K) 27ps (2%) 42us 
serial-115 (64K) 27ps (4%) 54us 
ethernet-small 2.4s 1003 78ps (25%) 44us 78us 370us 
ethernet-large 3.45 1703 487s (8%) 45us 491us 1194ps 


Table 7: Behavior of spl-optim kernel during device tests. Frequency of device interrupts and time to handle an interrupt from 
each device was measured. Although the spread between minimum and maximum is large, the median is in all cases other than 
floppy is within 2% of the mean. Each test measured the reception of several thousand interrupts. 











IDE disk, which copy data from the device to the kernel 
one word at a time. 

Weare concerned that high-frequency interrupts are 
not overly delayed while in long-duration interrupt 
handlers. We gathered data on both long-duration 
handlers (IDE disk interrupts) and high-frequency 
interrupts (serial lines at 115.2K bps). 

We measured the time required to process interrupts 
from a variety of devices attached to our test system. 
Our tests involve reading from or writing to each device 
and measuring the amount of time required to perform 
the test and the number of interrupts delivered while 
performing the test. We ran the following tests: 
¢ ide disk: write 8MB to IDE disk. 
¢ scsi disk: write 8MB to SCSI disk. 
¢ scsi cd-rom: read 8MB from SCSI CD-ROM. 
¢ serial-300: read 8KB (via tip) at 300bps. 
¢ serial-38.4: read64KB (via tip) at 38.4Kbps. 
¢ serial-115.2: read 64KB (via tip) at 115.2Kbps. 
¢ floppy: read 1MB from raw floppy disk. 
¢ ether-small: flood ping of 1000 64 byte packets. 
¢ ether-large: flood ping of 1000 1024 byte packets. 

The results of these tests are shown in Table 7. The 
mean time to handle an interrupt from a particular 
device is useful, although we found that there can be 
considerable variation. For each test we also include the 
minimum, maximum, and median interrupt handling 
time. 


6.1 IDE, SCSI, and Floppy Interrupts 


As mentioned above, the ide disk test performs 
programmed I/O (PIO), i.e., data is copied by the CPU 
one byte or word at a time between the disk controller 
and main memory. This is very CPU intensive, and can 
be quite slow. By comparison, the SCSI disk controller 
uses direct memory access (DMA) to copy data directly 
from the controller to main memory without the 
intervention of the CPU. Each ide disk interrupt took 
substantially longer than a scsi disk interrupt (1519ps 
vs. 65s). 

Surprisingly, although the specifications of the IDE 
and SCSI disks is very similar, it took nearly three times 
as long to transfer 8MB to the SCSI disk as it did to 
transfer it to the IDE disk. We believe that this is an 
artifact of the NetBSD 1.2 disk drivers; we saw similar 
performance with the generic NetBSD 1.2 kemel and 
our cli-pessim keel. We do not believe that this reflects 
the underlying performance of the disk or the controller; 
under BSD/OS 2.1 the performance of the two disks was 
much closer (in fact, the SCSI disk was about 10% 
faster than the IDE disk). 

The mean time required to process an IDE interrupt 
(1.5ms) is substantially greater than any other interrupt 
processing time we measured. We discuss the possibility 
of timing conflicts between IDE interrupts and other 
interrupts in Section 6.4. 

The standard deviation of floppy interrupt 
processing time is quite large (57%). It is caused by the 
several different types of interrupts generated by the 
floppy device, from seek completion notification (4s), 
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to transfer completion notification (25s), to data 
transfer (250ps). The maximum time seen, 27I ps, is 
short enough that we are not concemed about floppy 
interrupts. 


6.2 Serial Interrupts 


We see that a serial-300 interrupt takes one third as 
long as a serial-38.4 intetrupt: we attribute this to the 
fact that multiple characters are transferred on each 
serial-38.4 interrupt, as evidenced by the fact that 1/8 as 
many interrupts were generated per kilobyte by the 
latter. (Please note that the serial-300 test transferred 
8KB, where the other two tests transferred 64KB.) 

When we tested higher-rate serial connection (at 
115.2 bps) we saw roughly the same number of 
interrupts as we saw in the serial-38.4 test (8195 vs. 
8176) and the same mean interrupt handling time 
(27ps), implying the same number of characters 
processed per interrupt, but the total test took one third 
as long. The high rate of interrupts during the serial- 
115.2 test may conflict with the large amount of time 
required to process an ide-disk interrupt; we discuss this 
possibility in Section 6.4. 


6.3. Network Interrupts 


As stated above, our network tests were taken using 
a 10Mbps Ethernet board. We ran three tests: flood 
(continuous) ping, UDP, and TCP. We saw no significant 
variation in interrupt processing time as a function of 
protocol (ping/IP, UDP, and TCP), and hence include 
only the ping/IP results. 

Interrupt processing time is independent of protocol 
because the higher-level protocol processing (where we 
would see differences between TCP, UDP, and ping) 
does not take place when the packet arrives; instead, the 
packet is queued for processing by a higher-level “‘soft” 
interrupt handler which runs once the hardware interrupt 
has completed. 

We ran additional tests with varying packet sizes. 
We saw the expected relationship between packet size 
and processing time (which more or less scaled linearly 
with packet size). 

Faster network technologies (e.g., 100Mbps 
Ethernet and ATM) would generate interrupts much 
more frequently than our 10Mbps Ethernet interface. 
However, as we see below, the rate at which the serial 
device generates interrupts is sufficient to preclude use 
of the cli schemes. Based on this, the interrupt 
generation rate of ATM is a moot point. 


6.4 Summary 


When examining the data gathered for each 
interrupt, we found that although there was often a 
significant difference between the minimum and 
maximum, in most cases the standard deviation was 
relatively small, and the median was close to the mean. 

Our results show that in our environment it takes 
little time to handle an interrupt from any of our devices. 
No interrupt seen took more than 2ms to service, and in 
most cases the interrupt processing time was several 
orders of magnitude less than that. 

However, we see from Table 7 that interrupts can 
occur very frequently. For example, the serial-//5.2 
interrupts arrived at 1400Hz (every 700s). We must 
examine the consequences of delaying the processing of 
these interrupts. 

As was discussed in Section 2, we can partition 
interrupts into two classes: solicited and unsolicited. 
Solicited interrupts come in response to requests from 
the CPU (e.g., in response to a disk request). Unsolicited 
interrupts are generated externally, and are not under the 
control of the system (e.g., serial and network traffic). 
The rate of solicited interrupts are under the control of 
the system; when a solicited interrupt is received, the 
system spends its time processing the interrupt rather 
than generating more requests. The system can thus 
control the rate at which solicited interrupts are 
generated. 

The system does not control the rate at which 
unsolicited interrupts are generated. However, the 
system can indirectly impact the rate at which they are 
generated, either in software (by not sending 
acknowledgments to network packets that are dropped 
or processed in time), or in hardware (by use of serial- 
line flow control). 

To block serial interrupts it is necessary for the 
system to communicate with the device, and to be aware 
that it is necessary to slow the interrupt rate. If serial 
interrupts are overly delayed, the system will not be 
aware of the possibility of overflow, and hence will not 
be able to stem the tide in time. 

From our measurements, the longest that an 
interrupt will be delayed is 2ms (the longest time spent 
handling an ide-disk interrupt). The serial-//5.2 device 
receives characters every 7Ops (115.2 bits per sec = 
14,400 bytes per sec = one byte every 701s). This means 
that up to 28 characters (2ms / 70ps = 28) could arrive 
while an Ethernet interrupt is being processed. Although 
the serial port is buffered, the buffer only holds 16 
characters. At this rate, characters could easily be lost 
on the serial line while processing an IDE interrupt. 

If we could eliminate programmed I/O devices, the 
cli schemes would work. However, newer devices with 
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a high interrupt rate (e.g., ATM network controllers), 
combined with the possibility of slow inter-upt handlers 
shift the balance against the cli techniques. Although, in 
some hardware combinations, the cli schemes would 
work, many common configurations could easily lead to 
loss of data. 

The failure of the strict c/i schemes leads us to 
propose a new, hybrid scheme, cli-sp/-pessim, which is 
described in Section 7. The cli-spl-pessim scheme 
combines the low cost of the cli schemes for critical 
section synchronization with the interrupt prioritization 
of the spi schemes. 


7 Analysis and Proposal 


Our results show that under the benchmark workloads 
the absolute cost of synchronization using the optimistic 
techniques is low, less than 0.4%. The traditional spl- 
pessim scheme has a much higher cost, over 6% in all 
tested cases. 

The performance improvement we see is less than 
that seen by Stodolsky et al. with their spl-optim scheme 
(14%). We attribute this in part to the differences 
between our benchmarks and theirs; their test consisted 
of a highly optimized null RPC, which has a single 
critical section. Because the null RPC took very little 
time (2140 cycles), an improvement in critical section 
synchronization had a large performance impact. Under 
the more varied loads that we measured, the kernel 
spends a much lower percentage of its time in 
synchronization code. 

For the cli schemes to work, critical sections and 
interrupt handlers must complete their work quickly. 
These schemes completely disable interrupts during a 
critical section, so a long-running interrupt handler can 
delay the delivery of interrupts for a long period of time, 
increasing the likelihood of data loss. The combination 
of devices whose interrupt handlers require a long time 
to service (e.g., IDE disks with programmed I/O) with 
low latency requirements (e.g., very fast serial ports) is a 
worst-case scenario for the c/i scheme. 


7.1 Proposal 


We find the simplicity of the pessim schemes and 
the low cost of the c/i schemes appealing. This leads us 
to propose a fifth scheme, cli-spl-pessim, which 
combines the c/i technique for critical sections with the 
prioritization of the sp/ technique for interrupt handlers. 
In this scheme, the c/i and sti instructions are used to 
mask interrupts while the kernel is in a critical section; 
because critical sections are quite short, this will not 
overly delay interrupt delivery. When handling an 
interrupt, the kernel communicates with the off-chip 
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interrupt controller, disabling the delivery of equal and 

lower-priority interrupts. Because 

¢ interrupts are delivered much less frequently than 
critical sections are entered, 

* interrupt handlers take much longer than critical 
sections, and 

* on our hardware platform, when an interrupt is 
delivered it is necessary to communicate with the off- 
chip controller, 

the additional performance impact of going off-chip will 

be minimal and more than made up for by the increase 

in system robustness. In the format of Section 3, we 

include the following pseudo-code for cli-spl-pessint: 


void crit_sec_enter(int level) 
disable al! interrupts 


void crit_sec_leave(int level) 
enable all interrupts 


void interrupt_handler 
prev_level = cur_interrupt_level 
PIC_mask(interrupt_level) 
handle_interrupt 
PIC_mask(prev_level) 


The benefit of this scheme is the low cost of 
synchronization in critical sections. In addition, because 
we use an spi scheme for synchronizing handlers, a low- 
priority handler will be interrupted by the arrival of a 
high-priority interrupt. 

Because our model includes only the cost of 
synchronization of critical sections, and disregards the 
cost of the synchronization overhead in interrupt 
handlers, using our model the cost of cli-spl-pessim is 
identical to that of the cli-pessim scheme, with (as 
shown in Table 6) sub-1% synchronization overhead for 
the system loads that we studied. 

With the performance characteristics and simplicity 
of the cli-pessim scheme, without its incorrect behavior 
under heavy load, we plan to use the cli-spl-pessim 
scheme in the implementation of our new operating 
system kemel. 


TD Related Schemes 


Although we have described NetBSD as using the spi- 
optim technique, it also supports a cli-optim-style tech- 
nique for short duration interrupt handlers. 

The Linux! operating system uses a scheme similar 
to cli-spl-pessim. The cli and sti instructions are used to 
synchronize critical sections (with the concomitant high 
performance), and, like NetBSD 1.2, short-duration 
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interrupt handlers are cun with all interrupts disabled. 
Long-duration handlers are either run with no protection 
from interrupts (which increases their complexity) or 
wake a kernel task to handle the processing of the 
interrupt, and then retum. Although this scheme 
efficiently synchronizes critical sections, we believe that 
the increased latency and complexity of waking a 
separate task argue against use of this technique. 


8 Conclusions 


In this paper we have shown that it is worthwhile to 
rethink how interrupts and synchronization are handled 
on modern hardware’. 

Our results have driven the design of our new 
kernel, where we use the proposed cli-spl-pessim 
scheme. This scheme provides us with the simplicity of 
the pessimistic schemes we describe, with the low 
overhead of the optimistic schemes. 

Comparing our scheme to those used in earlier 
systems, we are reminded of how quickly hardware 
changes: the CPU and YO bus of today is substantially 
faster than the one available in 1993 [Stod93], and 
much, much faster than the one available on the VAX or 
the PDP-11 [Leff89]. Nevertheless, developers of newer 
systems [Cust93] find themselves re-using old, complex 
techniques to solve a problem that may no longer exist. 


Status and Availability 


All code discussed in this paper (benchmark tests and 
patch files for a cli-pessim NetBSD 1.2) is available at 
http://www.eecs.harvard.edu/~chris. 
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Porting UNIX* to Windows NT 


David G. Kom (dgk@research. att.com) 


AT&T Laboratories 
Murray Hill, N. J. 07974 


Abstract 


The Software Engineering Research department at 
Murray Hill writes and distributes several widely 
used development tools and reusable libraries that are 
portable across virtually all UNIX platforms.!'] To 
enhance reuse of these tools and libraries, we want to 
make them available on systems running Windows 
NT?! and/or Windows 95°). We did not want to 
support multiple versions of these libraries, and we 
wanted to minimize the amount of conditionally 
compiled code. 


This paper describes an effort of trying to build a 
UNIX interface layer on top of the Windows NT and 
Windows 95 operating system. The goal was to 
build an open environment rich enough to be both a 
good development environment and a_ suitable 
execution environment. This meant that the 
overhead needed to be small enough so that there 
was no incentive to program to the native operating 
system directly. The openness meant that the 
complete facilities of the native operating system 
were accessible through this environment. 


The result of this effort is a set of libraries, headers, 
and utilities that we collectively refer to as UWIN. 
UWIN contains nearly all the X/Open Release 4!4! 
headers, interfaces and commands. We _ discuss 
altemative porting strategies, commercial products, 
design goals, problems that had to be overcome, and 
the current status. Some performance measurements 
of the current system are presented here. 


1. INTRODUCTION 


The marketplace has dictated the need for software 
applications to work on a variety of operating system 
platforms. Yet, maintaining separate source code 
versions and development environments creates 


additional expense and requires more programmer 
training. 


One way to lower this cost is to use a middleware 
layer that hides the differences among the operating 
systems. The problem with this approach is that it 
forces you to program to a non-standard, and often 
proprietary, interface. In addition, it often limits you 
to the least common denominator of features of the 
different operating systems. 


An alternative is to build a middleware layer based 
on existing standards. This has been the approach 
followed by IBM with the introduction of 
OpenEdition"! for the MVS operating system, URL 
http: //www.s390. ibm. com/products/oe. 
OpenEdition is X/Open compliant so that a large 
collection of existing software can be transported at 
little cost. 


Windows NT is an operating system developed by 
Microsoft to fill the needs of the high-end market. It 
is a layered architecture, designed from the ground 
up, built around a microkemel that is similar to 
Mach." One or more subsystems can reside on top 
of the microkernel which gives Windows NT the 
ability to run different logical operating systems 
simultaneously. For example, the OS/2 subsystem 
allows OS/2 applications to run on Windows NT. 
The most important subsystem that runs on Windows 
NT is the WIN32 subsystem. The WIN32 subsystem 
runs all applications that are written to the WIN32 
Application Programming Interface (API)!7). The 
API for the WIN32 subsystem is also provided with 
Windows 95, although not all of the functions are 
implemented. In most instances binaries compiled 
for Windows NT that use the WIN32 API will also 
run on Windows 95. 


The POSIX subsystem allows applications that are 
strictly conforming to the IEEE POSIX 1003.1 


* UNIX is a registered trademark, licensed exclusively through X/Open. Limited. 
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Operating system standard) to run on Windows NT. 
Since the POSIX standard contains most of the 
standard UNIX system call interface, many UNIX 
utilities are simple to port to any POSIX system. 
Because most of our tools require only the POSIX 
interface, we thought that it would be sufficient to 
port them to the POSIX subsystem of Windows NT. 
We were wrong for the reasons described in the next 
section. 


We investigated alternative strategies that would 
allow us to run programs on both UNIX and 
Windows NT based systems. After looking at all the 
alternatives, we decided to write our own library that 
would make porting to Windows NT and Windows 
95 easy. We spent three months putting together the 
basic framework and getting some tools working. 
Realizing that the task was larger than a one person 
project, we contracted a small development team of 2 
or 3 to do portions of the library, packaging, and 
documentation. This paper will discuss porting 
alternatives, the goals for our library, the issues that 
need to be addressed, and the implementation of our 
POSIX library. Finally, we present some 
performance results and future directions. 


2. ALTERNATIVE STRATEGIES 


Six basic strategies can be employed to port existing 
UNIX based applications to Windows NT. The first 
strategy is to rewrite the code using the WIN32 API. 
This strategy makes sense if there are no 
requirements to continue to run on a UNIX system. 
Otherwise, this strategy will either require two sets 
of source (which will most likely be too expensive to 
maintain) or the use of a WIN32 emulation library 
that runs on UNIX platforms. There are at least two 
vendors that have WIN32 API libraries for UNIX 
systems. We ruled out this approach because of the 
effort to rewrite the code to the WIN32 API and 
because the WIN32 API is more complex than the 
X/Open API. 


The second strategy is to use the Microsoft C library. 
Microsoft supplies a library of routines that are 
similar to their UNIX counterparts. You could then 
make modifications to your application as necessary 
where the calls differ from the UNIX call. This 
strategy has been used by at least one commercial 
UNIX tools vendor to port GNU based tools to 
Windows NT. While this strategy is appropriate for 
some applications, other applications may require 
much work to overcome some subtle differences. In 
addition, the resulting code may have a large amount 
of conditionally compiled code that is hard to test 
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and maintain. 


A third strategy would be to rewrite the code using a 
framework which provides a virtual system interface. 
There are several vendors that offer object-oriented 
application layer interfaces that encapsulate the 
Operating system and therefore enable applications to 
work on multiple systems. There are three 
drawbacks to this approach. First of all, it requires a 
large up front investment. Secondly, you will be 
locked into the vendors’ libraries and not able to take 
advantage of savings that result from competition. 
Finally, you will likely be restricted to the 
intersection of features available on the underlying 
platforms. 


A fourth strategy is to port the application to the 
POSIX subsystem of Windows NT. The POSIX 
subsystem can run any strictly conforming IEEE 
POSIX application program. This strategy should 
not require major investment, and any investment 
that you make should increase the portability of your 
application to other POSIX conforming systems. 
Unfortunately, this is not a viable alternative for 
most applications. Microsoft has made the POSIX 
subsystem as useless as possible by making it a 
closed system. There is no way to access 
functionality outside of the 1990 POSIX 1003.1 
standard from within the POSIX subsystem, either at 
the library level or at the command level. Thus, you 
cannot even invoke the Microsoft C compiler from 
within the POSIX subsystem. However, since you 
can invoke POSIX commands from the WIN32 
subsystem, it is possible to port some stand alone 
programs to the POSIX subsystem. For example, we 
ported the pax utility, the POSIX 1003.2!%! 
replacement for cpio and tar, to Windows NT, 
and it can be invoked from any WIN32 program. 
Softway System, Inc., URL 
http://www.softway.com, has an agreement 
with Microsoft to enhance the POSIX subsystem so 
that they can achieve POSIX 1003.2 conformance. 
Softway claims that they will open up the POSIX 
subsystem so that it can access WIN32 applications. 
Even if the POSIX subsystem on Windows NT is 
opened up, the POSIX subsystem is not available for 
Windows 95. 


The fifth strategy is to use an existing POSIX or 
X/Open library that runs in the WIN32 subsystem. 
At the time that we began this effort, we were aware 
of two vendors that sell such libraries but as 
discussed later, these products were less than 
satisfactory. In addition, Steve Chamberlain at 
Cygnus has started writing a POSIX interface for 


USENIX Association 


USENIX Association 


Windows NT and Windows 95, but it appears as if 
his goals are less ambitious than ours. URL 
http://www. cygnus .com/misc/gnu-—win32/. 


A sixth and final strategy would be to write your 
own POSIX library using the WIN32 API. After 
investigating the other alternatives, this is what we 
decided to do. We are convinced that this was the 
best strategy for us, since we believe that it resulted 
in a better implementation than the two commercial 
products described later, and because it eliminates 
the need to pay licensing fees for each copy of each 
product that uses the library. The availability of 
source code makes it possible to provide adequate 
support. 


3. GOALS 


We wanted our software to work with Windows 3.1, 
Windows 95, and Windows NT. A summer student 
wrote a POSIX library for Windows 3.1 and we were 
able to port a number of our tools. However, the 
limited capabilities of Windows 3.1 made it a less 
than desirable platform. We instead focused our 
goals on Windows NT and Windows 95. We 
decided to use only the WIN32 API for our library 
so that the library would work on Windows 95 and 
so that all WIN32 interfaces would be available to 
applications. 


Initially, our goal was to provide the IEEE POSIX. 1 
interface with a library. This would be sufficient to 
run ksh and about eighty utilities that we had 
written. It soon became obvious that this wasn’t 
enough for many applications. Most real programs 
use facilities that are not part of this standard such as 
sockets or IPC. 


We needed to provide a character based terminal 
interface so that curses based applications such as vi 
could run. After the initial set of utilities was 
running, we wanted to get several socket based tools 
working. Several projects at AT&T that became 
interested in using our libraries, required the System 
V IPC facilities. The S graphics system!'®! and 
ksh-93"] required runtime dynamic linking. As 
the project progressed, the need for privileged users, 
such as root on UNIX systems, surfaced. We 
decided that it was important to have setuid and 
setgid capabilities. It soon became clear that we 
needed full UNIX functionality and we set our goal 
on X/Open Release 4 conformance. 


We needed to have a complete set of UNIX 
development tools since we didn’t want to get into 
the business of rewriting makefiles or changing build 


scripts. Most code written at AT&T, including our 
own, uses nmake!!2], (no relation to the Microsoft 
nmake), but we also wanted to be able to support 
other make variants. We didn’t want to do manual 
configuration on tools that have automatic 
configuration scripts. 


One important goal that we had from the beginning 
was to not require WIN32 specific changes to the 
source to get it to compile and execute. The reason 
for this is that we wanted to be able to compile and 
execute UNIX programs without having to 
understand their semantics. In addition we wanted to 
limit the number of new interfaces functions and 
environments variables that we had to add to use our 
library. It is difficult to manage more than one or 
two environment variables when installing a new 
package. 


Another goal that we had was to provide a robust set 
of utilities with minimal overhead. If utilities written 
to the X/Open API were noticeably slower than the 
same utilities written to the native WIN32 API, then 
they were likely to be rewritten making our library 
unnecessary in the long run. 


A final and important goal was interoperatability 
with the native Windows NT system. Integration 
with the native system not only meant that we could 
use headers and libraries from the native system, but 
that we could pass environment variables and open 
file descriptors to commands written with the native 
system. There couldn’t be two unrelated sets of user 
ids and separate passwords. If write permission were 
disabled from the UNIX system, then there should be 
no way to write the file using facilities in the native 
system and vice versa. 


We have not as yet achieved all of our goals, but we 
think that we are close. We are in the process of 
Tunning the X/Open conformance tests to verify 
compliance with the X/Open API’s. The rest of the 
paper will discuss some of the issues we needed to 
deal with and our solutions. 


4. PROBLEMS TO SOLVE 


The following problems need to be understood and 
dealt with in porting applications to Windows NT. 
These are some of the issues that need to be 
addressed by POSIX library implementations. 
Section 6 describes how UWIN solved most of these 
problems. 
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4.1 Windows NT File Systems 


Windows NT supports threc different file systems. 
called FAT, HPFS, and NTFS. FAT, which stands 
for File Access Table, is the Windows 95 file 
system. It is similar to the DOS file system except 
that it allows long file names. There is no 
distinction between upper and lower case although 
the case is preserved. HPFS, which stands for High 
Performance File System, was designed for OS/2. 
NTFS, the native NT File System, is similar to the 
Berkeley file system.!7) It allows long file names (up 
to 255 characters) and supports both upper and lower 
case characters. It stores file names: as 16 bit 
Unicode names. 


The file system namespace in Win32 is hierarchical 
as it is in UNIX and DOS. A pathname can be 
separated by either a / or a \. Like DOS, and 
unlike UNIX, disk drives are specified as a colon 
terminated prefix to the path name, so that the 
pathname c:\home\dgk names the file in directory 
\home\dgk on drive c:. Many UNIX utilities 
expect only / separated names. and expect a leading 
/ for absolute pathnames. They also expect multiple 
/’s to be treated as a single separator. 


Even though NTFS supports case sensitivity for file 
names, the WIN32 API has no support for case 
sensitivity for directories and minimal support for 
case sensitivity for files, limited to a 
FILE_FLAG_POSIX_SEMANTICS creation flag for 
the CreateFile() function. Certain characters 
such as *, ?, >, |, :, ", and \, cannot be used in 
filenames created or accessed with the WIN32 API. 
The names, aux, coml, com2, nul, and 
filenames consisting of these names followed by any 
suffix, cannot be created or accessed in any directory 
through the WIN32 API. 


Because Windows 95 doesn’t support execute 
permission on files, it uses the . exe suffix to decide 
whether a file is an executable. Windows NT 
doesn’t require this suffix, but some NT utilities, 
such as the DOS command interpreter, require the 
.exe suffix. 


4.2 Line Delimiters 


Windows NT uses the DOS convention of a two 
character sequence <cr><nI1> to signify the end of 
each line in a text file. UNIX uses a single <n1> to 
signify end of line. The result is that file processing 
is more complex than it is with UNIX. There are 
separate modes for opening a file as text and binary 
with the Microsoft C library. Binary mode treats the 
file as a sequence of bytes. Text mode strips off 
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each <cr> in front of each new-line as the file is 
read, and inserts a <cr> in front of each <nl> as 
the file is written. Because the number of characters 
read doesn’t indicate the physical position of the 
underlying file, programs that keep track of 
characters read and use lseek( ) are likely to not 
work in text mode. Fortunately, many programs that 
tun on Windows NT do not require the <cr> in 
front of each <nl> in order to work. This 
difference tumed out to be less of a problem that we 
had originally expected. 


4.3 Handles vs. file descriptors 


The WIN32 API uses handles for almost all objects 
such as files, pipes, sockets, processes, and events, 
and most handles can be duped within a process or 
across process boundaries. Handles can be inherited 
from parent processes. Handles are analogous to file 
descriptors except that they are unordered, so that a 
per process table is needed to maintain the ordering. 


Many handles, such as pipe, process, and event 
handles, have a synchronize attribute, and a process 
can wait for a change of state on any or all of an 
array of handles. Unfortunately, socket handles do 
not have this attribute. One of the few novel 
features of WIN32 is the ability to create a handle 
for a directory with the synchronize attribute. This 
handle changes state when any files under that 
directory change. This is how multiple views of a 
directory can be updated correctly in the presence of 
change. 


4.4 Inconsistent Interfaces 


The WIN32 API handle interface is often 
inconsistent. Failures from functions that retum 
handles return either O or -1 depending on the 
function. The CloseHandle() function does not 
work with directory handles. The WIN32 API is 
also inconsistent with respect to calls that take 
pathname arguments and calls that take handles. 
Some functions require the pathname and others 
require the handle. In some instances, both calls 
exist, but they behave a little differently. 


4.5 Chop Sticks Only 


The WIN32 subsystem does not have an equivalent 
for fork() or an equivalent for the exec*() 
family. There is a_ single primitive, named 
CreateProcess() that takes 10 arguments, yet 
still cannot perform the simple operation of 
overlaying the current process with a new program 
as execve() requires. 
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4.6 Parent/Child Relationships 


The WIN32 subsystem does not support parent/child 
relationships between processes. The process that 
calls CreateProcess() can be thought of as the 
parent, but there is no way for a child to determine 
its parent. Most resources, such as files and 
processes, have handles that can be inherited by child 
processes and passed to unrelated processes. Any 
process can wait for another process to complete if it 
has an open handle to that process. There is a 
limited concept of process group that affects the 
distribution of keyboard signals, and a process can be 
placed in a new group at startup or can inherit the 
group of the parent process. There is no way to get 
or set the process group of an existing process. 


4.7 Signals 


The WIN32 API provides a structured mechanism 
for exception handling. Also. signals generated from 
within a process are supported by the API. 
However, signals generated by another process have 
no direct method of implementation. In addition to 
being able to interrupt processing at any point, a 
signal handler might perform a longjmp and never 
retum. 


4.8 Ids and Permissions 


Windows NT uses subject identifiers to identify users 
and groups. A subject identifier consists of an array 
of numbers that identify the administrative authority 
and sub-authorities associated with a given user. A 
UNIX user or group id is a single number that 
uniquely identifies a user or group only within a 
single system. Information about users is kept in the 
a registry database which is accessible via the 
WIN32 API and the LAN manager API. 


Windows NT uses an access control list, ACL, on 
each file or object to control the access of the file or 
object for each user. UNIX uses a set of permission 
bits associated with the three classes of users: the 
owner of the object, the group that the object belongs 
to. and everyone else. While it is possible to 
construct an access control list that more or less 
corresponds to a given UNIX permission, it is not 
always possible to represent a given access control 
list with UNIX permissions. 


Windows NT has separate permissions for writing a 
file, deleting a file, and for changing the permission 
on a file. The write bit on UNIX systems detertnines 
all three. Thus, it is possible to encounter files that 
have partial write capability. 
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UNIX processes have real and effective user and 
group id’s that control access to resources. Windows 
NT assigns each process a security token that defines 
the set of privileges that it has. UNIX systems use 
setuid/setgid to delegate privileges to processes. 
Windows NT uses a technique called impersonation 
lo carry out commands on behalf of a given user. 
There is no user that has unlimited privileges as the 
root user does with UNIX. Instead the special 
privileges of root have been broken apart into 
separate privileges that can be given to one or more 
users. One of the biggest challenges we faced was 
providing the UNIX model of setuid/setgid on top of 
the WIN 32 interface. 


The implementation of WIN32 for Windows 95 does 
not support the NT security model and calls retum a 
not implemented error. 


4.9 Terminal Interface 


Windows NT and Windows 95 allow each character 
based application to be associated with a console 
which is similar to an xterm window. Consoles 
support echo and no echo mode, and line at a time or 
character at a time input mode, but lack many of the 
other features of the POSIX terini.os interface. 
There is no support for processing escape sequences 
that are sent to the console window. In echo mode, 
characters are echoed to the console when a read call 
is pending, not while they are typed. There are 
separate console handles for reading from the 
keyboard and writing to the screen. 


4.10 Special Files 


The WIN32 API supports unnamed pipes with the 
UNIX semantics. Named pipes are also supported 
but have different semantics than fifos and occupy a 
separate name space. There is no /dev directory to 
name special files such as /dev/tty and 
/dev/null. The WIN32 does support special 
names of the form \\.\PhysicalDrive for disk 
drives and tape drive devices. 


Windows NT supports hard links to files, but there is 
no WIN32 API call to create these links. They do 
not support symbolic links in the file system directly, 
but on Windows 95 and on Windows NT 4.0, the file 
browser does support short cuts which are very 
similar to symbolic links. 


4.11 Shared libraries 


The WIN32 API supports the linking of shared 
libraries at program invocation and at run time. The 
libraries are called dynamically linked libraries or 
DLL’s and are represented by two separate files, 
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One file provides the interface and is needed at 
compile time to satisfy external references. The 
second file contains the implementation as is needed 
at run time. 


There are some restrictions on DLL’s that are not 
found on UNIX © system — shared library 
implementations. One restriction is that you cannot 
override a function called by a DLL by providing 
your own version of the function. Thus, supplying 
your own malloc() and free() functions will 
not override the calls to malloc() and free() 
made by other DLL’s. Secondly, the library can 
only contain pointers to data, not data itself. Thus, 
making a symbol such as errno part of a DLL is 
impossible. Even making symbols such as stdin 
point to data in a DLL invites trouble since it is not 
possible to compile code that uses 
static FILE *myfile = stdin; 


4.12 Compilers and libraries 


Microsoft sells the Visual C/C++ compiler for 
Windows NT and Windows 95. This compiler has 
both a graphical and command line interface. 
Microsoft also sells a software developers kit (SDK) 
that contains tools, including the Microsoft nmake. 
The compiler and linker use a different set of flags 
than standard UNIX compilers, and C files produce 
.obj files by default, rather than .o files. 
Fortunately, the linker can handle both .obj and .o 
files. The linker has options to choose a starting 
address and to specify whether the application is a 
console application, a GUI application, a POSIX 
application, or a dynamically linked library. 


4.13 Environment Variables 


The WIN32 API supports the creation and export of 
environment variables in much the same way that 
UNIX systems do. Some environment variables, 
such aS PATH are used by both WIN32 and by 
UNIX, yet have different formats. UNIX uses a : 
separated list of pathnames; WIN32 uses a ; 
separated list. 


5. COMMERCIAL POSIX LIBRARY 
INTERFACES 


We purchased software from the two commercial 
vendors that we were aware of that sell POSIX 
libraries for Windows NT that run under the WIN32 
subsystem. Each offers a software development kit 
containing include files and libraries, and each offers 
a set of UNIX utilities. Both of these vendors 
require a license to use their libraries in products. 
We used earlier versions of their products but based 
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on their web pages at the time this paper was 
written, the following description still applies. Both 
of these vendors supply cc commands that invoke 
the underlying Microsoft Visual C/C++ compiler. 
Neither of these products support symbolic links, job 
control and fifos. Both appear to have implemented 
the exec*() family incorrectly in that the process 
that does the exec does not terminate until the child 
process completes. A process that repeatedly execs 
itself will eventually cause the operating system to 
run out of processes. It is not clear from their home 
pages whether either of these products work with 
Windows 95. 


5.1 NuTCracker from DataFocus 


NuTCracker, by Datafocus, URL 
http://www.datafocus.con, makes an 
attempt to support UNIX conventions. It maps 


Windows NT file names to and from UNIX file 
names, and adjusts the PATH environment variable 
accordingly. For example, it maps the Windows NT 
file name d:\bin to the UNIX filename /d=/bin 
and handles the special names /dev/null and 
/dev/tty. The = is a poor choice because the 
POSIX.2 standard for the shell language leaves the 
behavior of commands that have an = in their name 
unspecified. 


NuTCracker ships the MKS Toolkit as the utilities. 
The MKS Toolkit is a completely independent 
implementation that does not use the NuTCracker 
libraries. We view this as a serious deficiency since 
the behavior or the utilities is no guide as to the 
correctness or functionality of the NuTCracker 
library. 


The NuTCracker library lacks some functions not 
defined by POSIX or ANSI C that are available on 
UNIX systems such as hsearch() and 
cuserid(). 


In addition to the above deficiencies, NuTCracker 
does not support filename case distinction. 


NuTCracker supports a Motif library for porting X11 
based applications including a version that offers a 
Windows look and feel. 


5.2 Portage from Consensys 


The other product that we purchased is named 
Portage and is sold by Consensys Systems, URL 
http://www.consensys.com. The source is 
based on System V, Release 4, which makes it the 
more suitable for most AT&T products. Their 
utilities were built from the System V source, but it 
was clear that changes were made in order to port 
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them to Windows NT. 


Portage Version 1.0 does not map Windows NT file 
name into UNIX names. They have modified some 
tools such as ksh to recognize ; as the PATH 
delimiter in place of :. Version 1.0 did not support 
case distinction, but their home page indicates that 
they now do. 


In terms of functionality, the NuTCracker suite is 
more complete than Portage. 


6. UWIN DESIGN AND 
IMPLEMENTATION 


We started work on writing our own POSIX library 
at the beginning of 1995 after being frustrated with 
the existing commercial products. We were able to 
put together a useful subset of functions in about 3 
months. However, to be successful, it was necessary 
to provide as complete a package as possible. The 
library needed to handle console and serial line 
support, sockets, UNIX permissions, and other 
commonly used mechanisms such as memory 
mapping, IPC, and dynamic linking. In addition, to 
be useful, the libraries had to be documented and 
supported. This put the scope of the project outside 
of the reach of a small research department such as 
ours. 


We subcontracted some of the development to 
AT&T GIS in India to help complete this project. 
We jointly designed the terminal interface and the 
group in India implemented it. They also worked on 
completing the sockets library. They packaged the 
software for installation and are _ providing 
documentation. This section describes the UWIN 
implementation and how we solved many of the 
problems described in Section 4. 


6.1 UWIN Architecture 


The current implementation of UWIN consists of two 
dynamically linked libraries named posix .d11 and 
ast .d1l that more or less implement the functions 
documented respectively in section 2 and section 3 
of UNIX manuals. In addition, a server process 
named UMS runs as Administrator (the closest 
thing to root). UMS generates security tokens for 
setuid/setgid programs as needed. It also _ is 
responsible for keeping the /etc/passwd and 
/etc/group files consistent with the registry 
database. The Architecture for UWIN is illustrated 
in Figure 1. The UMS server does not exist for 
Windows 95. 


The posix.d11 library maintains an open file table 
that is shared by all the currently active UNIX 
processes in a memory mapped region. This region 
is writable by all processes so that an ill-behaved 
process could affect another process. Even though 
all processes have read and write access to the shared 
segment, secure access to kernel objects in Windows 
NT is not compromised by this model because a 
process must have access rights to an object to use 
it; knowing its address or value doesn’t give 
additional access rights. Some initial measurements 
indicated that the alternative of having a server 
process update the shared memory region, would 
have had a performance penalty that we did not 
believe was worth the cost. However, this is an area 
for future investigation. 


The open file table is an array of structures of type 
Pfd_t as illustrated in Table 1. 


char type 


TABLE 1. File Table Structure 





The refcount field is used to keep track of free 
entries in this table. The Win32 
InterlockedIncremenet ( ) and 
InterlockedDecremenet() functions are used 
to maintain this count so that concurrent access by 
different processes will work correctly. The oflag 
field stores the open flags for the file. The type 
field indicates what type of file, regular, pipe, socket, 
or special file. The function that is used read from 
or to write to the file depend on the value of type. 
For certain types, the extra field stores an index 
into a type-specific table that stores additional 
information about this file. 


The posix.dll library also maintains a per 
process structure, Pproc_t. The per process 
structure contains information required by UNIX 
processes that is not required by Win32 processes 
such as parent process id, process group id, signal 
masks, and process state as illustrated in Table 2. 


Like the open file table, the process table maintains a 
reference count so that process slots can be allocated 
without creating a critical region. The meaning of 
most of the fields in the process structure can be 
deduced by its name. The Psig_t_ structure 
contains the bit mask for ignored, blocked and 
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Figure 1. —- UWIN Architecture 


pending signals. When the first child process is 
invoked by a process, a thread is created that waits 
for this and subsequent processes to complete. The 
waitevent field contains an event this thread also 
waits on so that additional children can be added to 
the list of children to wait for. 


















Pproc_t 
long refcount 
HANDLE 
HANDLE 
HANDLE 
HANDLE 
ulong ntpid 

pid_t pid, ppid, pgrp, sid 


Pprocfd_t fdtab [OPEN_MAX] 


proc, thread 
sigevent 
waitevent 
etok,rtok 














TABLE 2. Process Table Structure 


The process structure contains an array of up to 
OPEN_MAX structures of type Pprocfd_t that is 
indexed by file descriptor. The Pprocfd_t 
structure contains the close-on-exec bit. the index of 
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the file in the open file table, and the corresponding 
handle or handles as illustrated in Table 3. 


The posix.dll_ library implements the 
malloc(), realloc(), and free() interface 
using the Vmalloc library written by Kiem-Phong 
Voll4]. The vmalloc library provides an interface 
to walk over all memory segments that are allocated 
which is needed for the fork() implementation 
described later. 











Pproc_t 


short index 
HANDLE primary 





HANDLE secondary 


TABLE 3. Process file structure 


The ast. d11 library provides a portable application 
programming interface that is used by all of our 
utilities. The interface to this library is named 
libast.a, for compatibility with its name on 
UNIX systems. libast.a provides C library 
functions that are not present on all systems so that 
application code doesn’t require #ifdefs to handle 
system dependencies. libast.a is built using the 
iffe command "*! to feature test the host system 
and determine what interfaces do not exist in the 
native system. 
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libast.a relies on the Microsoft C library for 
most of the ANSI-C functionality. The most 
significant exception to this, other than malloc() 
which is provided by posix.d1l, is the stdio 
library. libast.a provides its own version of the 
stdio library based on calls to Sfiol!®. The 
Sfio library makes calls to posix .d11 rather than 
making direct calls to the WIN32 API as the 
Microsoft C library does so that pathnames are 
correctly mapped. 


The use of Sfio also provides a simple solution to 
the <cr><nl> problem. When a file is explicitly 
opened for reading as a text file, an Sfio discipline 
for read() and lseek() can be inserted on the 
stream to change all <cr><nl> sequences are to <nl>. 
The lseek() discipline uses logical offsets so that 
the removal of <cr> characters is transparent. We 
did not provide a discipline to change <nl> to 
<cr><nl> since we discovered that most Windows 95 
and Windows NT utilities worked without the <cr>s. 
The <cr>s could be inserted by a filter such as sed 
if required. 


6.2 Files 


The posix.dll library performs the mapping 
between handles and file descriptors. Usually, each 
file descriptor has one handle associated with it. In 
some cases, two handles may be associated with a 
file descriptor. An example of this is a console that 
is open for reading and writing which uses separate 
handles for reading and writing. 


The posix.d1l library handles the mapping 
between UNIX pathnames and WIN32 pathnames. 
Many UNIX programs assume that pathnames that 
do not begin with a / are relative pathnames. In 
addition, only / is recognized as a delimiter. There 
is only a single root directory; the operation of 
changing to another drive does not change the root 
directory. The posix.d1ll library maps all file 
names it encounters. If the file name begins with a 
/ and the first component is a single letter, then this 
letter is taken as the drive letter. Thus, the UNIX 


filename /d/bin/date_ gets translated to 
d:\bin\date. The file name mapping routine 
also recognizes special file names such as 


/dev/tty and /dev/null. A / not followed by 
a drive letter is mapped to the drive that UWIN has 
been installed on so that programs that embed 
absolute pathnames for files in /bin, /tmp, /dev, 
and /etc work without modification. 


Finally, the path search algorithm was modified to 
look for .exe and .bat suffices. 


One problem introduced by the pathname mapping is 
that passing file name arguments to native NT 
utilities is more difficult since it understands DOS 
style names, not UNIX names. A library routine was 
added to retum a DOS name given a UNIX name. 


The posix.d11 library pathname mapping function 
also takes care of exact case matching on file 
systems that require it. One of the most troublesome 
aspects of the WIN32 API is its lack of support for 
pathname case distinction. It is not uncommon to 
have files named Makefile and makefile in the 
same directory in UNIX. UWIN handles case 
distinction by calling the WIN32 CreateFile() 
function both with and without the 
FILE_FLAG_POSIX_SEMANTICS function. If 
they compare equal, it executes the function 
internally, otherwise it spawns a POSIX subsystem 
process to carry out the task. 


6.3 fork/exec 


The fork() system call was implemented by 
creating a new process with the same startup 
information as the current process. Before executing 
main(), it copies the data and stack of the parent 
process into itself. Handles that were closed when 
the new process was created are duplicated into the 
new process. The exec*() family of functions was 
much harder to implement. The problem is that 
there is no way to overlay the calling process. 
Portage and NuTCracker have the current process 
wait for the child process to complete and then exit. 
There are two problems with this approach. First, a 
process that execs repeatedly will fill up the process 
table. More importantly, resources from the parent 
process are not released. Our method causes the 
child process to be reparented to the grandparent and 
the process that calls exec*() to exit. The process 
id returned by the getpid() function will be the 
process id of the process that invoked the exec* () 
function. In other cases, it will be the same process 
id as the WIN32 uses. To prevent that process id 
from being used again by WIN32, a handle to the 
process is kept by the grandparent process. 


Even though we implemented fork() and the 
exec*() family of functions, our code rarely uses 
them. Because the CreateProcess() function 
doesn’t have an overlay flag, two processes need to 
be created in order to do both fork() and 
exec*(). libast provides a spawn*() family 
of functions that combines the functionality of 
fork( )/exec*() on systems that don’t have the 
spawn*() family. All functions in libast that 
create processes such as system() and popen() 


1997 Annual Technical Conference 


51 





52 


are programmed with this interface. On most UNIX 
systems, the spawn*() family is written using 
fork() or vfork() and exec*(). We 
implemented spawn*() in our posix.d11 library 
to call CreateProcess() directly. 


6.4 Signals 


Signals are handled by having each process run a 
thread that waits on an event. To send a signal to a 
process, the bit corresponding to the given signal 
number is set in the receiving process’s process 
block, and then its signal thread event is set. The 
signal thread then wakes up and looks for signals. It 
is important for the signal handler to be executed in 
the primary thread of the process, since the handler 
may contain a longjmp() out of the handler 
function. Prior to calling main(), an exception 
filter is added to the primary thread that checks for 
signals. The signal thread does this by suspending 
the primary thread raising an exception that will 
active the exception filter of the primary thread, and 
then resuming the primary thread. 


6.5 Terminals 


The POSIX termios interface is implemented by 
creating two threads; one for processing keyboard 
input events, and the other for processing output 
events and escape sequences. These threads are 
connected to the read and write file descriptors of the 
process by pipes. The same architecture is used for 
socket based terminals and serial I/O lines. Initially, 
these threads run in the process that created the 
console and make it the controlling terminal. These 
threads service all processes that share the 
controlling terminal. New threads will be created if 
the process that owns the threads terminates and 
another process is sharing the console. When a 
process is created, these threads are suspended and 
the console handles are passed down to the child. 
This enables a native application to run with its 
standard input and output as console handles. If the 
application has been linked with the posix.d1l1, 
then these threads are resumed before main () is 
called so that UNIX style terminal processing takes 
place. The result is that UNIX processes will echo 
characters as they are typed and respond to special 
keys specified by stty, whereas native WIN32 
applications will only echo characters when they are 
read and will use Control-C as the interrupt 
character. 


6.6 Ids and Permissions 


Permissions for files are only available on Windows 
NT. Calls to get an set perinissions return not 
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implemented errors on Windows 95. Creating a 
Windows NT ACL that closely corresponds to UNIX 
perinissions isn’t very difficult. The ACL needs 
three entries; one for owner, one for group, and one 
that represents the group that contains all users. 
Windows NT allows separate permission to delete a 
file and to change its security attribute. These 
permissions are give to the owner of a file. The 
UNIX umask() command sets the default ACL so 
that native applications that are run by UWIN will 
create files with UNIX type permissions. 


Mapping of subject identifiers to and from user and 
group ids is more complex. UWIN maintains a table 
of subject identifier prefixes, and constructs the user 
id and group id by a combination of the index in this 
table and the last component of the subject identifier. 
The number of subject identifier prefixes that are 
likely to be encountered on a given machine is much 
smaller than the number of accounts so that this table 
is easier to maintain. 


6.7 Special files and Links 


Special files such as fifos and symbolic links require 
stat () information that is not kept by the NT or 
FAT file systems. Also, the file system does not 
store the setuid and setgid permission bits. 
With the NT file system, this extra information has 
been stored by using a poorly documented feature 
called multiple data streams that allows a file to have 
multiple individually named parts. A separate data 
stream is created to hold additional information about 
the file. The SYSTEM attribute is put on any file or 
directory that has an additional data stream so that 
they can be identified quickly with minimal overhead 
during pathname mapping. 


Using multiple data streams requires the NT file 
system. On other file systems, fifos and symbolic 
links are implemented by storing the information in 
the file itself. The setuid, setgid functionality is not 
supported on these file systems. 


UWIN treats Windows 95 and Windows NT 4.0 
short cuts as if they were symbolic links. However, 
these links can be created with any of the UWIN 
interfaces. This was done by reverse engineering the 
format of a short cut file and finding where the 
pathname of the file that it referred to was stored. 


Fifos are implemented by using WIN32 named pipes. 
A name is selected based on the creation date of the 
fifo file. Only the first reader and the first writer on 
the fifo create and connect to the named pipe. All 
other instances duplicate the handle of either the 
reader or the writer. This way all writers to a fifo 
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use the same handle as required by fifo semantics. 


A POSIX subsystem command is also invoked to 
create hard links since there is no WIN32 API 
function to do this. Hard links fail for files in the 
FAT file system. 


6.8 Sockets 


Sockets are implemented as a layer on top of 
WINSOCK, the Microsoft API for BSD sockets. 
Most functions were straight forward to implement. 
The select () function proved more difficult than 
we had anticipated because socket handles could not 
be used for synchronization, and because the 
Microsoft select() call only worked with socket 
handles. The posix.dll select() function 
allows different types of file descriptors to be waited 
for. 


Our first implementation of select () created a 
separate thread that used the Microsoft select( ) 
to wait for socket handles, and created an event for 
the main thread to add to the list of handles to wait 
for. Our second implementation used a library 
routine to convert input/output events on sockets to 
windows messages and then waited for both 
windows messages and handle events simultaneously. 
This method had the added advantages that it was 
possible to implement SIGIO and that it was easy to 
add a pseudo file device named /dev/windows 
that could be used to listen for windows messages. 
Adding this pseudo device made it possible to use 
the UNIX implementation of tcl to port tksh!!7] 
applications to Windows NT. 


6.9 Invocation 


When UWIN invokes a process, it does not know 
whether the process is a UWIN process or a native 
process. It modifies the PATH variable so that it 
uses the ; separated DOS format. It also passes open 
files in the same manner that the Microsoft C library 
does so that programs that are compiled with this 
library should correctly inherit open files from 
UWIN programs. The initialization function also 
sees whether a security token has been placed in its 
address space by the UMS server, and if so, it 
impersonates this token. 


The POSIX library has an initialization routine that 
sets up file descriptors and assigns the controlling 
terminal starting the terminal emulation threads as 
required. The posix.1lib library also supplies a 
WinMain() function that is called when the 
program begins. This function initializes the stdin, 
stdout, and stderr functions and then calls a 


posix.d11 function passing the address of another 
posix. 1ib function that actually invokes main(). 
The posix.d1l function starts up up the signal 
thread and sets the exception filter for signal 
processing as described above. The reason for this 
complexity is so that UNIX programs will start with 
the correct environment, and so that argv[0] will 
have UNIX syntax without the trailing .exe since 
many programs use argv[0]. Much of the 
complexity occurs inside the posix.dll part 
because programs do no require recompilation when 
changes are added there. 


7. CURRENT STATUS 


At the time of this writing, most interfaces required 
by the X/Open Release 4 standard have been written 
and work as described in the standard. The X/Open 
standard requires full ANSI C functionality as well. 
In addition, interfaces for the curses library, the 
sockets library, the dynamic linking library, are also 
working. 


A C/C++ compiler wrapper has been written that 
calls either the Microsoft Visual C/C++ 2.x or 4.x 
compiler. This compiler supports the most 
commonly used UNIX conventions and implicitly 
sets default include files and libraries. In addition it 
has an added hook for specifying native compiler 
and linker options. Applications compiled with our 
cc command can be debugged with native debuggers 
such as the Visual C/C++ debugger. Several auto 
configuration programs use the output of the C 
preprocessor to probe the features of the system. 
The output format of the Microsoft C compiler 
caused some of the configuration programs to fail. 
To overcome this, a filter is inserted when running 
the compiler to generate preprocessor output so that 
existing configuration programs work. Our compiler 
wrapper can be invoked as cc for ANSI-C 
compilation, as CC for C++ compilation, and as pec 
to build POSIX subsystem applications. 


Our compiler wrapper follows the normal UNIX 
defaults for suffixes rather than using the Microsoft 
conventions; .o’s rather than .obj’s. The .exe 
suffix is not required for Windows NT since it uses 
permission bits to distinguish executables. However, 
since we also want binaries to run on Windows 95, 
the .exe suffix is added to the name of the output 
file if no suffix is supplied when the compiler is 
invoked as cc or CC. 


The lastest version of ksh, ksh—93 was ported. 
The implementation supports all features of ksh-93 
including job control and dynamic linking of built-in 
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commands at run time. While no changes to the 
code should have been necessary, changes were 
made to ksh specifically for NT. The hostname 
Mapping attribute, typeset -H, which has no 
effect on UNIX systems, was modified to call the 
posix.dll function that returns the WIN32 
pathname corresponding to a given UNIX pathname. 
The ability to do case insensitive matching for file 
expansion was also added. A compile time option to 
allow <cr><nl> in place of <nl> was added to the 
shell grammar to avoid the overhead of text file 
processing. 


About 150 UNIX tools have been ported to Windows 
NT, the vast majority required no changes. Common 
software development tools such as yacc, lex, 
make and nmake have also been ported. Most of 
the utilities are versions that we have written at 
AT&T over the last ten years and are easily portable 
to all UNIX platforms. Other utilities, such as 
make, bc, and gzip we compiled from the GNU 
source using autoconfig to generate headers and 
makefiles. The yacc and less utilities and the 
new vi program were ported from freely available 
BSD source code. In most cases, no changes were 
made to the original source code. 


The X Windows code has two parts, the client and 
the server. The server had already been ported to 
Windows NT and Windows 95 by commercial 
vendors and there was no need to build UWIN 
version for it. In addition, the server is often 
running on a UNIX host. The most difficult part of 
porting the X Windows client code was the fact that 
it had #ifdefs for WIN32 that selected native 
WIN32 calls, bypassing the UWIN calls. Once this 
was straightened out, the compilation was 
straightforward. 


8. PERFORMANCE 


There are two issues to consider with respect to 
performance. The first is how UWIN pertorms 
compared to using the WIN32 API and/or Microsoft 
C library directly. The comparison of UWIN to 
native performance measures one of the costs 
involved in using UWIN as opposed to using an 
alternative strategy such as rewriting to the WIN32 
API. 


The second is how Windows NT performs relative to 
other UNIX systems. The performance of UNIX 
operating systems on the Pentium processor was 
investigated by Keven Lai and Mary Baker!!®), and 
showed that except for networking, the Linux 
system, URL http: //www. linux. org, performs 
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the best of the UNIX systems. The comparison to a 
UNIX system may be important in deciding whether 
to choose a UNIX platform or a Windows NT 
platform, although other considerations often dictate 
this choice. 


All performance comparisons were made on an 
Micron computer with 133MZ Pentium processor 
and 32M-bytes of memory. The WIN32 
Measurements were made using the NTFS file system 
on Windows NT 4.0 operating system. The UNIX 
Measurements were make on Linux version 2.0.18 on 
the same hardware. 


There are four sets of tests. The first set of tests, 
shown in Table 4, are the same ones used in the 
1991 Usenix Sfio paper. The implementation of 
stdio under UWIN uses Sfio rather than the 
Microsoft implementation since the Microsoft 
implementation makes WIN32 calls directly rather 
than going through the read()/write() UNIX 
interface. The Linux tests were run with an Sfio 
implementation rather than using the native 
implementation to make it easier to compare results. 
The results show that applications that are dominated 
by calls to Stdio or Sfio are likely to perform at 
least as well when run under UWIN. 


The second set of tests, summarized in Table 5, 
Measures the performance of certain systems calls; 
for example the time to open and close files, to read 
and write data, the time to create and delete files, 
and the time to open and read directories. The tests 
are as follows: 


1. Open and close a file 10000 times. 
2. Create and delete a file 10000 times. 


3. Open and read a directory containing two files 
10000 times. 


4. Open and read a directory containing 500 files 
10000 times. 


5. Run system("/bin/echo") 100 times. 
The tests were run five times and the middle three 
times were averaged. The first four tests report the 
sum of user+system time. The last test uses only 
elapsed time because of the difficulty of obtaining 
accumulated times for processes using 
CreateProcess() call. 


These tests show that creating and deleting files in 
UWIN is much slower than with Linux. Much of 
the time difference is due to the way UWIN deletes 
files to provide UNIX semantics. With UNIX it is 
possible to delete a file while it is open, and then 
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create a file of the same name. Clearly a more 
efficient mechanism for doing this is needed. 


While reading small directories is slower than with 
Linux, the tests show that large directories are 
actually faster with NT. The system() test shows 
that Linux is quite a bit faster in launching processes 
than NT. The NT native test, unlike the UWIN test, 
executes /bin/echo directly without running the 
shell so that the results are better than they might 
otherwise be. 


The third set of benchmarks is called the Modified 
Andrew Benchmarks!'9], These benchmarks measure 
the elapsed time to perform a set of tasks such as 
copying files, doing recursive walks, and compiling 
code. The original set of Andrew Benchmarks used 
the native C compiler; the Modified Andrew 
Benchmarks come with source code for a stripped 
down version of gcc so that the differences in 
compilers can be eliminated. It look little effort to 
modify the makefiles and build the compiler. The 
time to run the Modified Andrew Benchmark was 
110 seconds under UWIN. The time was a mere 18 
seconds under Linux. This benchmark shows the 
effect of the slower file and process creation times. 
The UWIN times could be improved by using the 
spawn family of functions in place of fork/exec 
to execute the components of the compiler as the 
UWIN cc command does. In both cases the test 
failed to complete near the final step; creating the 
archive because the native archiver was unable to 
handle the format produced by the generated gcc 
compiler. 


The final set of benchmarks that we tried to run was 
the benchmark suite named lmbench. written by 
Larry McVoy and presented at the 1996 USENIX 
conference°!, We were able to run only a portion 
of these benchmarks because many of the 
benchmarks require the rpc library which hasn’t 
been ported to UWIN yet. In addition, we omitted 
tests that were covered by earlier benchmarks. The 
results of this benchmark are presented in Table 6. 
While it confirms that the I/O bandwidth under 
UWIN is quite good, it also shows that other aspects 
such as pipe latency is quite large. We did not 
investigate this discrepancy. 


9. FUTURE 


We are in the process of running X/OPEN 
conformance tests on UWIN and see how close we 
have come to being compliant. In addition, we are 
trying to decide how to port the n dimensional file 
system, n-DFs!*!), to Windows NT. n-DFS provides 


a mechanism to add file system services such as 
viewpathing and versioning. The difficulty in 
porting n-DFS is that it must also capture native 
WIN32 API calls to provide a transparent interface. 


The current version of UWIN does not handle may 
of the internationalization issues well. The current 
implementation has been compiled for ASCII rather 
than UNICODE. We plan to use UFT8 encoding of 
UNICODE for the system call interface, and to 
convert to UNICODE on the NT file system. This 
way we do not need to build separate binaries for 
UNICODE. 


The current version of UWIN does not support files 
larger than two gigabytes because the size of of f_t 
is stored as a 32 bit integer. The underlying NTFS 
file system supports 64 bit file offsets. Since the 
next version of Sfio supports 64 bit file offsets, we 
plan to support large files in a future version of 
UWIN. 


Another issue worth investigating is whether it is 
possible to run Linux binaries under UWIN. This 
would only make sense for dynamically linked 
programs. 


Finally there are some WIN32 interfaces that could 
be handled through the file system interface such as 
the Windows NT registry and the clipboard. 


10. CONCLUSIONS 


There appear to be few if any technical reasons to 
move from UNIX to Windows NT. The 
performance of Linux exceeds that of NT 4.0 and 
Linux appears to be more reliable. On _ three 
occasions NT 4.0 crashed when running the 
performance tests. There were no crashes with 
Linux. However, if you want to or need to move an 
application to Windows 95 or Windows NT, we 
believe the POSIX library we developed to be 
superior to any of the existing commercial libraries. 
While in many cases the performance loss using 
UWIN is minimal, the performance tests show that 
UWIN needs improvement. 


The code for the posix.d11 library is fairly small, 
about 10K lines including the terminal emulator. 
This library runs in the WIN32 subsystem using the 
WIN32 API and runs under Windows 95 as well. 


We hope to be able to make version 1.1 of UWIN 
available in binary form on the internet. Check the 
internet web site 
http://www. research.att.com/sw/tools for 
details. We hope that this will encourage 
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contributions of applications that have been built 


with UWIN. 


fwrite 
fread 
revrd 


fw757 

fr757 

rev 757 
copy&rw 


seek+rw 
putc 
getc 
fputs 
fgets 
revgels 
fprintf 
fscanf 


oe nx] 
est se || seconds [_ Kb/s _|| seoonds | Kb/s |] seconds | 


10000K 
10000K 
10000K 
10000K 
10000K 
10000K 
10000K 

2000S 

SOO0K 

SO00K 
50000L 
50000L 
50000L 
50000L 
SOOO0L 


15923 
36231 
32679 
13227 
24154 
10989 
9940 
34566 
5720 
10245 
97656 
126262 
72254 
8567 
10888 









[ count |[seoonds [ws [| seconds [wis | seconds] 
10000 


2.88 0.45 22222 
Slt) 
2.80 











open/close 
create/delete 

readdir-2 
readdir-500 






10000 


system 


TABLE 5. Syscall timings 
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Null syscall 
Pipe latency 
Pipe bandwidth 


File write bandwidth 
File read bandwidth 
Mmap read bandwidth 


Memory read bandw 


idth 


Memory write bandwidth 


Unit 

us 

ms 
MB/sec 
KB/sec 
MB/sec 
MB/sec 
MB/sec 
MB/sec 


UWIN 
4 

295 
23:33 
1995 
33.33 
75.00 
90.91 
76.92 





TABLE 6. Selected /mbench results 
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ABSTRACT 

Protected Shared Libraries, or PSLs, are a new type of support for modularity that form a basis 
for building flexible library-based operating system services. PSLs extend the familiar notion of 
shared libraries with protected state and data sharing across protection boundaries. Protected state 
information allows PSLs to be used to implement sensitive operating system services. Sharing of 
data across protection boundaries yields significant performance benefits. These features make 
PSLs a viable basis on which a complete operating system can be built largely as a set of dynami- 
cally loadable libraries without compromising protection or sacrificing performance. PSLs also 
allow highly flexible implementations of new functionality to be added to current commercial oper- 
ating systems. A prototype PSL implementation has been built into AIX 3.2.5 and early perfor- 


mance results are encouraging. 


1. Introduction! 

Software flexibility relies on modularity, the ability 
to modify or replace individual software components 
easily. Modularity in tum relies on not only the soft- 
ware’s internal structure, but also on the degree to which 
modularity is supported by the underlying operating 
system and the efficacy of that support. It is not surpris- 
ing, therefore, that traditional monolithic systems which 
lack comprehensive support for modularity are charac- 
teristically inflexible and difficult to develop and main- 
tain. Production of highly adaptable and manageable 
systems relies on the development of modularity support 
which is flexible, efficient, and easy to use. 

Attempts to produce modular operating systems 
have generally followed one of two approaches. The 
first is to separate an existing operating system kernel 
into a microkernel that provides a basic set of funda- 
mental constructs and one or more user-level server 
tasks which run on top of the microkernel and provide 
operating system services [Black et al. 92] [Rosier at al. 
92]. This approach has been applied to a number of 
commercial operating systems [Batlivala et al. 92] [Bor- 
gendale at al. 94] [Golub et al. 90] [Golub et al. 93] 
[Malan et al. 90] [Phelan et al. 93] [Weicek et al. 93). 
The second approach is to design an entirely new oper- 
ating system emphasizing flexibility using object-ori- 
ented technology which generally includes language 
support. The second approach has primarily been rele- 
gated to academic and research environments [Bershad 
et al. 95] [Campbell et al. 93] [Yokote 92]. 
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Neither of these approaches features adequate flexi- 
bility, efficiency, and ease of use. By itself the microker- 
nel approach conveys separation of an operating system 
kemel along only a single line, the kernel-user bound- 
ary. Finer-grained decomposition of both the kermel- 
level and user-level portions remains an issue. Also, 
decomposition of the user-level portion into multiple 
user-level server tasks may be inefficient due to over- 
head associated with task-based protection [Condict et 
al. 93] [Ford & Lepreau 94] (Lepreau et al. 93] [Maeda 
& Bershad 93]. The language based object-oriented 
approach is generally applicable only to new systems. 

An alternate approach to modularity which pro- 
vides sufficient flexibility, efficiency, and is easily appli- 
cable to operating systems is needed. Protected Shared 
Libraries (PSLs) are just such an approach. PSLs extend 
the familiar notion of shared libraries by adding support 
for protected state and allowing data to be shared across 
protection boundaries. Protected Shared Libraries con- 
sist of two separate mechanisms: Protected Libraries 
and Context Specific Libraries. 

Protected Libraries associate access to specific state 
information with each library entry point. Entrance to a 
library routine, which can be enacted only via a defined 
entry point, conveys access to data associated with the 
routine; access is revoked when the routine is exited. 
Similarly, access to a process’s data segment is revoked 
upon entry to a Protected Library routine and restored 
upon retuming from it. Thus, library and client data are 
protected from each other’s code. 

Context Specific Libraries, or CSLs, share a single 
copy of code at a single address as traditional shared 
libraries do, but offer significantly increased flexibility 
regarding shared data. They may be seen as a mecha- 
nism for encapsulating information that needs to be 
shared between protection domains. CSLs allow data to 
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be shared between clients and a service in different 
ways. A CSL may share a single copy of data between 
multiple clients, such that all clients see the same data at 
identical locations. Alternatively, a CSL may share data 
between a client and a service at a single location such 
that the actual contents of the shared region is associated 
with the calling (client) or the called (service) region. 
Later sections present details of the various forms of 
sharing supported. 

The PSL infrastructure is based on two distinct 
hypotheses. First, given the benefits of shared libraries, 
it may be easier to compose user-level system services 
as sets of cooperating shared libraries rather than as sep- 
arate processes. Associating protection with shared 
libraries would allow modular library-based system ser- 
vices to be constructed to replace traditional servers and 
perhaps even some privileged mode components such as 
loadable virtual file systems. The ease with which 
shared libraries can be loaded, unloaded and dynami- 
cally relocated provides flexibility not easily attained 
with cooperating processes. Second, sharing between 
cooperating entities has to be an intrinsic part of the pro- 
gramming model and not retrofitted through facilities 
such as mmap. CSLs allow programmers to share infor- 
mation in libraries and control the exact nature of shar- 
ing. This opens up possibilities for easily creating UNIX 
u_block [Goodheart & Cox 92] implementations, shar- 
ing I/O buffers across protection domains [Khalidi & 
Nelson 93] and sharing closures and objects across pro- 
tection boundaries [Banerji et al. 94a]. 

The remainder of this paper proceeds as follows. 
The next section presents the motivation for Protected 
Shared Libraries, both to improve modularity and to 
facilitate sharing. After that, PSL semantics are 
described. Implementation issues are discussed in Sec- 
tion 4. Performance results from a prototype PSL imple- 
mentation are presented in Section 5, and finally, 
Section 6 presents a brief discussion of the contributions 
made by Protected Shared Libraries. 


2. Design Motivation 

Protected Shared Libraries are motivated by two 
factors. First, passive protection domains, particularly 
shared libraries, provide an excellent basis for software 
modularity. Second, efficiency requires cross-domain 
interactions use shared data. Discussion of these obser- 
vations continues in the next subsection. Following that, 
the overall PSL design approach is described. 


2.1 Shared Libraries as Protection Domains 
Enforced protection boundaries have been found to 


be a very effective software structuring tool especially 
for large systems [Nelson 91] [Bogle 94] [Khalidi & 
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Nelson 93] [Chase 94]. Protection can be enforced 
through a variety of means including separate address 
spaces [Acetta et al. 86], language support [Nelson 91], 
and post-processing of binary code [Wahbe 93]. Each of 
these approaches has been used to increase modularity 
and security and to facilitate debugging of large soft- 
ware systems. Protection has also been used to ease 
modification or replacement of software components 
(Pu 95] [Khalidi & Nelson 93] [Orr 92]. 

Most work regarding enforcement of protection 
boundaries in current operating system software, both 
user-level and kernel-level, has been focussed on 
improving the efficiency of cross-domain invocations. 
Invocation times have been significantly reduced by 
handoff scheduling [Black 89] and thread migration 
(Bershad 90] (Lepreau et al. 94] [Hamilton & Kou- 
giouris 93]. These protection domains, however, have 
typically been associated with processes. Counter exam- 
ples exist [Organick 72] [Scott et al. 90] [Wulf et al. 81], 
but have generally been limited to research efforts 
encompassing entirely new operating systems. The only 
system known to use passive protection domains effiec- 
tively in a commercial operating system is Mach 4.0 
(Lepreau et al. 94], but even that implementation is 
closely tied to the notion of processes. Our focus on 
using passive abstractions to represent protection 
domains is based both on the experience of others 
(Carter et al. 93] as well as our own [Banerji et al. 94a] 
that clearly demonstrate the advantages of passive mod- 
ularity. 

Use of passive entities, as opposed to active pro- 
cesses, to represent protection domains has usually 
taken one of two forms, objects as in Clouds and Psyche 
or shared libraries as in Multics.* A strong case for the 
support of passive objects as the basic structuring mech- 
anism has been made [Ford 93]. The most important 
advantages of passive protection domains are their abil- 
ity to better represent the common case of synchronous 
communication, their documented ability to support 
optimized implementations [Druschel 92] [Chase 94] 
(Carter et al. 93], and the ease with which they can be 
managed in user-level client code. The last advantage is 
especially important in making shared libraries a good 
vehicle for passive protection domains. 


2.1.1 Shared Library Limitations 

Most commercial operating systems support shared 
libraries in one form or another, but semantics vary from 
system to system. On most UNIX systems, library code 
is shared, but each client process accessing a shared 
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library gets its own copy of library data. This copy gets 
mapped into the process’ private data segment and is, 
therefore, equally accessible by client and library code. 
Some systems such as OS/2 [Deitel & Kogan 92] and 
Windows [King 94] allow shared or dynamically linked 
libraries (DLLs) to contain shared data as well as code, 
but offer little or no protection. Each client task access- 
ing a DLL has equal access to the DLL’s data. Malicious 
or errant clients can, therefore, corrupt shared data and 
adversely impact other clients. 

What is desired then is shared library support that 
features both per-client data as in UNIX, and global data 
as in OS/2 and Windows, but with protection. Specifi- 
cally, we desire protection of library data, including per- 
client data, from client code, and client data from library 
code. 


2.2 Cross-Domain Sharing 

Increasingly, system software complexity is being 
addressed through use of modular protection domains. 
Cross-domain interactions usually take the form of fast 
RPC mechanisms that circumvent much of the tradi- 
tional in-kernel RPC code-path. [Bershad 90] [Hamilton 
& Kougiouris 93] [Condict et al. 93] Efficiency of fast 
RPC mechanisms has been improved through use of 
shared message buffers [Bershad 90]. Some optimized 
implementations, such as the Fbufs approach [Druschel 
& Peterson 93], improve throughput by two orders of 
magnitude. Efficiency concerns, therefore, make a com- 
pelling case for sharing. Sharing is also indicated by 
structural considerations. 

Cross-domain sharing has been used to improve 
structure in various parallel programming models [Scott 
et al. 90], and to support persistent databases [Bogle 94] 
[Chase 94] and shared object frameworks [Campbell et 
al. 93] [Banerji et al. 94). Sharing enables cooperation 
between domains with limited trust [Chase 94]. Thus, 
sharing can be used to support a variety of interactions 
including producer-consumer, non-intrusive monitoring, 
asynchronous service providers [Bogle 94], shared 
pipes, stateless servers with client maintained state, and 
shared objects exported by servers [IBM 93]. 

There is a third reason for sharing across protection 
domains. Most implementations of cross-domain object 
interactions include a fair amount of overhead for the 
locally distributed case [IBM 93] [Janssen 95]. Thus, 
sharing object instances [Banerji et al. 94] and passing 
enclosures between protection domains on the same 
machine are usually inefficient. Most of these problems 
may be solved by judicious sharing of data and 
addresses between interacting protection domains. Such 
techniques can drastically reduce the cost of object 


interaction between protection domains on the same 
machine. 

Capturing this rich yet diverse set of structuring 
possibilities in an uniform abstraction requires careful 
design. Adequate programming support is needed so the 
benefits of sharing may be fully and easily exploited. 


2.2.1 Programming Support for Sharing 
Two attributes make shared information attractive 

to programmers. 

¢ First, the ability to share pointer-rich data across 
domains is attractive. This ability has been found to 
be useful for persistent stores [Chase 94], shared 
C++ objects (Banerji et al. 94], distributed shared 
data and system software. The obvious argument 
against uniform sharing is the need to reserve por- 
tions of an address space. This concern is decreas- 
ing with the increasing popularity of large 
effective address spaces such as the 52-bit global 
address space and 64-bit non-segmented address 
space in the POWER [Weiss 94] and Alpha archi- 
tectures respectively. The advantages in efficiency 
of avoiding pointer transformations in various 
applications and system software, as well as pro- 
grammer convenience are significant. 

e Second, the ability to treat shared data through 
symbolic names that maintain meanings across 
domains is attractive. A good example of this is the 
ability to call the shared “libc” version of malloc, 
uniformly from any process that links in the shared 
libe library. This facility can easily be extended to 
shared data, by involving the linker or loader in the 
manipulation of shared information. This approach 
can be seen in the shared libraries of systems as 
diverse as Multics, OS/2 and Hemlock [Garrett et 
al. 93). 

Clearly, with a little system support uniform 
addressing and naming, can be integrated into relocate- 
able object modules, as has been done with shared 
libraries in Multics, Hemlock and OS/2. In current 
implementations, however, the available sharing mecha- 
nisms provide limited flexibility. 

Context Specific Libraries are motivated by the 
observation that with a few system extensions, different 
modes of sharing, along with uniform addressing and 
naming, can easily be integrated with the common 
notion of shared libraries. Although other efforts have 
provided improved forms of sharing, few have been 
integrated with commercial operating systems. 


3. Protected Shared Library Semantics 
Protected Shared Libraries extend traditional shared 
libraries in two ways, with protected state data and 
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cross-domain data sharing. Protected Libraries protect 
library and client data from each other’s code. Context 
Specific Libraries allow global, client-specific and 
domain-specific data to be shared across protection 
domains. The semantics of each of these mechanisms is 
now described in detail. 


3.1 The Mechanisms 

Protected Libraries improve modularity and secu- 
rity and facilitate debugging of large software systems 
by enforcing protection domains between client and 
library code. Previously, other approaches to protection 
based on active entities such as processes have been 
used to improve modularity. Protected libraries investi- 
gate the alternative of using dynamically loadable pas- 
sive shared libraries to enforce protection. This idea is 
based on effiorts such as Psyche [Scott et al. 90] and 
Multics [Organick 72] both of which supported pro- 
tected dynamically loadable object modules. 

Context Specific Libraries represent modules of 
code and data that may be shared in various forms 
between different protection domains. They represent a 
communication channel between protection domains 
and thus augment traditional RPC mechanisms. CSLs 
extend the notion of cross-domain data and address 
sharing as found in Fbufs [Druschel & Peterson 93], the 
zero-copy /O framework and most implementation of 
the UNIX u-block [Leffler et al. 89]. Together, these 
mechanisms form a coherent set of structuring and com- 
munication abstractions which support construction of 
system software at user-level. 


3.2 Protected Libraries 


Protection domain semantics are determined prima- 
rily by resource management. Traditional protection 
domains such as processes act as containers of resources 
such as memory, threads, file handles, and semaphores. 
With process-based protection, resource management 
during control transfer from one domain to another is 
fairly simple. Resources belonging to the currently run- 
ning process are accessible. Resource management is 
more complex with library-based protection because 
resources are not typically associated with libraries. 

Resources may be categorized into memory 
resources, such as client and service data, and non- 
memory resources, such as file handles and semaphores. 
A programmer using protected libraries typically need 
be concerned only with the handling of memory 
resources. Handling of non-memory resources is more 
complicated and usually only relevant to advanced pro- 
grammers and library authors. Prior to describing 
resource management issues in detail, we define several 
terms. 
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3.2.1 Definitions 


Two threads executing concurrently within a pro- 
tection domain always see the same set of memory 
resources. A UNIX process constitutes a primary or root 
protection domain in which execution is initiated. A 
Protected Library is viewed as a secondary protection 
domain which is always associated with one or more 
primary domains. Execution in a secondary domain is 
always initiated by a thread entering it from another pri- 
mary or secondary domain. 

A thread is the primary unit of execution and can 
traverse protection domains. As in most multi-threaded 
process models, each thread owns a small set of per- 
thread resources such as scheduling and accounting 
information. These resources, usually encapsulated in a 
Shuttle [Hamilton & Kougiouris 93], belong to the 
thread and remain associated with the thread as it 
traverses protection domains. Most resources a thread 
accesses belong to the domain in which it is currently 
executing. All threads within a domain have equal 
access to the domain’s resources. Threads could con- 
ceivably be created in either primary or secondary 
domains, but thread creation is currently restricted to 
primary domains. 

If Protected Library calls were the only form of 
cross-domain interaction, thread movement would be 
restricted to one primary domain and its associated sec- 
ondary domains. Alternate mechanisms such as thread 
migration would allow a thread to move between multi- 
ple primary domains. The following discussion is sim- 
plified by limiting cross-domain control transfers to 
Protected Library invocations and eliminating consider- 
ation of other forms of IPC. 


3.2.2 Memory Resources 

A Protected Library is an enforced unit of modular- 
ity which contains code and data. Any stateless or state- 
full service that requires protection from either multiple 
users or from other software components can be imple- 
mented in a Protected Library. Protected Libraries can 
be viewed as regular shared libraries that export pro- 
tected entry points. Data associated with a Protected 
Library is accessible only once the library has been 
entered viaa defined entry point. To summarize, a Pro- 
tected Library is a passive protection domain that can be 
entered only through defined entry points. 

Figure | depicts a Protected Library that does not 
use any Context Specific Libraries. A client process can- 
not access any of the service data until it calls a defined 
entry point. A thread starts executing in the client- 
domain where it can access only the process’s own pri- 
vate code and data. This includes code and data defined 
in the client program and in any ordinary libraries linked 
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with it. Upon calling a defined service entry point, the 
thread enters the service protection domain. In doing so 
the thread loses access to its own private code and data 
and gains access to service code and data. 
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Figure 1 Protected Library 


This description illustrates the similarity between 
Protected Libraries and encapsulated objects. However, 
Protected Libraries differ from encapsulated objects in 
three respects. First, Protected Libraries enforce protec- 
tion boundaries. Admittedly, some object implementa- 
tions also do this. Second, Protected Libraries allow for 
direct information sharing between a client and a ser- 
vice, through Context Specific Libraries as described 
shortly. Finally, because Protected Libraries are actually 
protection domains, their semantics encompass operat- 
ing system and user environment resource management. 


3.2.3 Non-Memory Resources 

Protected Libraries use the “sum of resources” 
model to specify accessibility to non-memory resources 
when executing in a secondary or Protected Library 
domain. A thread’s resource set while executing in a 
Protected Library is the sum of a set of resources passed 
from the original primary domain, a set of resources 
associated with the library domain, and a subset of the 
per-thread resources associated with a thread no matter 
what domain it is executing in. The “sum of resources” 
model only affects Protected Shared Library (second- 
ary) domains and not the root or primary domain which 
is represented by a regular process context. This follows 


the Hydra [Wulf et al. 81] model of resource manage- 
ment which allows certain sets of resources to be passed 
essentially as parameters into a domain along with a call 
to the domain. The set of resources that must be explic- 
itly passed when entering a secondary domain is part of 
a Protected Library's interface specification. 

For the most part, programmers building Protected 
Libraries can safely ignore the notion of passing 
resources from one domain to another. Default specifi- 
cations allow this detail to be disregarded, in general, 
without complications. A programmer might have to 
choose a particular set of resources to be passed into a 
domain in the case of exception vectors, such as UNIX 
signal handlers, for user-level threads. Even in the case 
of UNIX signals, however, determination of which sig- 
nals should be handled on a per-domain basis and which 
must be handled by the primary domain would generally 
be straightforward. 


3.3 Context Specific Libraries 

Context Specific Libraries (CSLs) add sharing 
primitives to the PSL abstraction. CSLs are units of 
modularity, but not protection, that may be accessible by 
multiple domains simultaneously. CSLs contain infor- 
mation that must be shared across multiple domains. 
The CSL is mapped into different protection domains 
depending on the type of sharing required. CSLs thus 
extend traditional cross-domain RPC with which infor- 
mation is shared through parameters only by adding 
communication via shared memory. CSLs are not pro- 
tection domains but communication channels that may 
be associated with protection domains to share informa- 
tion. Depending on the type of sharing, as described 
below, a CSL may be viewed as a fixed piece of infor- 
mation accessible by all domains, mapped information 
that moves with a thread across domains, or in other 
ways. A programmer need decide only what information 
resides in the CSL and what kind of sharing is required. 
Mapping and sharing of CSLs in multiple domains is 
handled by the PSL implementation. 


3.3.1 Context Specific Library Properties 

A CSL is uniformly shared and named in all 
domains in which it is visible. This implies that symbol 
names and addresses seen by different domains, are con- 
sistent for a given CSL. It does not imply that all 
domains that use a particular CSL necessarily see the 
same contents. However, all domains using a particular 
CSL see the same resolved names at the same addresses; 
they may or may not see the same contents depending 
on the type of CSL. 

As discussed in the previous section, this approach 
of sharing information between domains has consider- 
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able benefits, such as the ability to exchange pointers 
between domains and deal with shared data through 
symbolic names rather than pointers. This advantage 
results from encapsulating information in shared librar- 
ies. Use of shared libraries rather than a facility such as 
mmap or shared memory IPC implies that shared infor- 
mation is subject to relocation during loading. This 
allows uniform relocation and address space allocation 
required for uniform naming and sharing to be imple- 
mented relatively easily. CSLs may, therefore, be called 
and used from PSL based protection domains or the 
originating root domain with exactly the same effect. 
From the point of view of the calling protection domain, 
CSLs look like regular shared libraries with different 
data sharing characteristics. CSLs appear to execute in 
the context of the calling domain with one exception. 

A CSL module executes in the kernel context of the 
calling domain. The kernel resources available to CSL 
code are, therefore, those of the client domain calling 
the CSL. A CSL also maintains its own user-level con- 
text. The reason for this is based on programming expe- 
rience with CSLs which are frequently used to allocate 
and manipulate shared data. In such cases, it is 
extremely useful to depend on a memory allocation or 
malloc function that is specialized towards the kind of 
sharing supported by the particular CSL. Thus, a pro- 
grammer requiring a particular kind of data sharing 
encapsulates data and its manipulating code in a CSL 
and uses regular malloc calls. This frees the programmer 
from explicitly having to deal with multiple versions of 
malloc. 

CSLs support three distinct types of sharing each 
intended to support a different type of interaction 
between protection domains. In all three types, each 
protection domain accessing the CSL views the CSL at 
the same address. The three types of sharing differ in the 
contents seen by different domains. 


3.3.2 Global Context Specific Libraries 

Global CSLs are used to share data and addresses 
among multiple domains. A Global CSL is depicted in 
Figure2. Global CSLs feature a single instance of their 
associated data. This instance is mapped at a single 
address into the address space of each primary or sec- 
ondary protection domain accessing the CSL. The glo- 
bal sharing model relies on clients of the shared data 
using voluntary locking protocols to maintain coher- 
ency. 

One example use of a Global CSL is for a global 
memory allocation facility that can be used by multiple 
processes and PSLs to support globally shared data. In 
this scenario, both addresses and their associated data 
must be shared. Nearly the same effect could be 
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Figure 2 Global Context-Specific Library 


achieved by mmaping files between different protection 
domains. However, the involvement of the linker in the 
creation of CSLs allows reference to data variables via 
symbolic names instead of through pointers. Global 
CSLs thus simplify use of shared data. 


3.3.3 Client Context Specific Libraries 

Client CSLs facilitate sharing of process-specific 
information between client and library domains. Shared 
information, encapsulated in a library, is mapped into 
each process’s address space as shown in Figure3. All 
processes get their own copy of the data, but the data is 
located at identical locations in all domains. Upon call- 
ing a Protected Library entry point, the CSL data of the 
current primary (process) domain gets re-mapped into 
the service domain. Thus, Protected Library code sees 
the CSL data belonging to the current process. Figure3 
shows that when process one calls the service, its CSL 
data gets mapped into the library domain. With this 
formn of sharing, the client CSL gets mapped and 
unmapped as a protected call traverses multiple 
domains. Synchronization mechanisms based on volun- 
tary locks may be used to ensure no two threads of the 
same parent process simultaneously modify shared data. 
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Figure 3 Client Context-Specific Library 





Client CSLs can be used to implement client-spe- 
cific meta-data. A Protected Library which implements 
a file system, for example, might maintain process spe- 
cific meta-data at a certain fixed address. Whenever a 
thread of a particular process invokes the file system 
library, the library code automatically refers to the cli- 
ent-specific meta-data. A similar principle is used to 
implement u-blocks in most commercial UNIX imple- 
mentations. Per-client meta-data may also be used to 
maintain per-client method/function tables. This allows 
client-process specific behavior to be automatically 
encapsulated within Client CSLs. Any changes to a cli- 
ent-specific behavior is thus easily limited to particular 
process. 


3.3.4 Domain Context Specific Libraries 
With a Domain Context Specific Library, depicted 
in Figure 4, addresses are shared, but data is not. With a 
Domain CSL, a distinct copy of the library data is main- 
tained for each protection domain. The data is mapped 
at the same address in each domain. This form of 
address sharing includes both static and dynamic data. 
Dynamic allocation of memory in a Domain CSL may 
be viewed as acquisition of a shared resource, specifi- 
cally the addresses shared by all clients of the Domain 
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Figure 4 Domain Context-Specific Library 


CSL. If a particular domain allocates memory in a 
Domain CSL, the allocated addresses may not be reused 
by any other domain. Consequently, in Domain CSLs 
locking is supported during address allocation but no 
locks are necessary when data is accessed because every 
domain has its own data. 

One possible use of Domain CSLs is for sharing of 
C++ objects that contain pointers to virtual function 
tables (vtbls). In this case, the vtbl must be located at the 
same address in every domain, but its contents must be 
unique per domain [Banerji et al. 94]. Thus, the address 
of a function table can be shared while ensuring the con- 
tents of the table are unique per domain. 


3.4 Potential Uses of PSLs 

Protected Shared LIbraries are a valuable tool for 
structuring systems. PSLs provide obvious benefits 
including the efficiency associated with the use of 
shared data and passive as opposed to active protection 
domains. PSLs also provide more subtle benefits as 
described below. 


3.4.1 Scope Management 
Protected Libraries and CSLs may be used to effi- 
ciently implement scope management in systems that 
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use meta-object protocols for extensibility. Meta-object 
protocol based implementations offer two sets of inter- 
faces; one provides access to normal functionality, the 
other optionally allows manipulation of the service 
implementation. The two interfaces associated with 
meta-object protocols are depicted in Figure 5. A critical 
issue in meta-object protocol implementations is ensur- 
ing that changes made to a service implementation by a 
client affect only the client making the changes. This 
issue is referred to as scope management. 


Client 






Base Interface Meta Interface 
Service Implementation 


Figure 5 Meta-Ob ject Protocol 


Scope management is often implemented by main- 
taining per-client function or method tables. Creating 
these per-client method tables in client CSLs as shown 
in Figure 6 ensures that the service protected library 
always “sees” the method tables of the current calling 
root domain. Most UNIX implementations use a similar 
facility to implement u_blocks. However, this requires 
that u_block addresses be hard-coded and thus the 
functionality is limited to in-kernel co-location of 
u_blocks. Client CSLs eliminate the need for hard- 
coding addresses and open up the functionality to any 
protected library client. 


3.4.2 Locally Distributed Objects 

Distributed object implementations that cross 
machine boundaries need marshalling/unmarshalling 
and method table pointer initializations because shared 
memory facilities often do not extend across machines. 
Most marshalling/unmarshalling and method table 
pointer manipulations are unnecessary in distributed 
object implementations that do not cross machine 
boundaries. However, most implementations do not use 
shared memory to implement distributed objects effi- 
ciently in the local case [Radia 95]. Client CSLs and 
domain CSLs can be used to avoid marshalling/unmar- 
shalling or method table pointer initialization int the 
local case. 

Figure 7 shows instance data created in a client 
CSL with a method table in a domain CSL. The instance 
data gets mapped into the called domain when a locally 
distributed object is invoked. Use of domain CSLs 
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ensures the appropriate method tables are found in each 
domain at identical addresses. As Figure 7 indicates the 
caller and callee method tables are co-located since they 
are created in a domain CSL, but their contents are dif- 
ferent. In the caller, the method table contains a pointer 
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to a stub-method, whereas in the callee the method table 
contains a pointer to the actual method. Thus, judicious 
use of client CSLs and domain CSLs can eliminate 
extraneous overhead in locally distributed object imple- 
mentations. 


3.4.3 Per-Client Protection Schemes 

Use of passive libraries implies per-client bindings 
thus allowing for flexibility in composing protected 
library modules. Shared libraries with or without protec- 
tion require binding of client relocations to service sym- 
bols. The nature of these bindings is such that they are 
always maintained on a per-client basis. Thus, when 
shared libraries are used as protection domains, the 
binding between a client and a protected service may be 
adjusted on a per-client basis. While most processes 
transfer to some trampoline code to access a service 
entry point, a trusted client may be linked to have direct 
access. This allows for highly efficient trusted pro- 
cesses. Similarly, well-tested and debugged clients may 
be linked to directly access a service, bypassing protec- 
tion boundaries. Finally, using facilities for dynamic 
binding of symbols, protection boundaries may be 
removed and inserted at run-time, if necessary. All these 
facilities, result from a single design choice, the use of 
shared libraries or more specifically the use of passive 
modules subject to relocation. 


4. PSL Implementation 

One important aspect of the PSL research to date 
has been the construction of a prototype implementa- 
tion. The main objectives of the prototype were to clar- 
ify the PSL semantics and provide an experimental 
testbed for a quantitative performance analysis. The pro- 
totype was built on AIX 3.2.5, and consists of a modi- 
fied AIX kernel, C runtime libraries and a new linker. 
This prototype is only one possible implementation of 
PSL semantics. Because other operating systems, exe- 
cutable file formats and hardware architectures may 
imply different implementations, only the main imple- 
mentation issues are described here. 

We first present an overview that draws the connec- 
tion between different semantic features and aspects of 
the implementation. The following subsection describes 
the components themselves in detail. Prior to delving 
into implementation issues, a brief description of the 
RS/6000 memory architecture is presented. 


4.1 RS/6000 Memory Architecture 

A given address space on an RS/6000 is defined by 
a set of sixteen segment registers, each of which con- 
tains a 24-bit segment ID. The RS/6000 uses 32-bit vir- 
tual addresses, four bits of which identify a segment 


which effectively extends the 32-bit virtual address to 52 
bits. Of the remaining 28 bits from the original 32-bit 
virtual address, 16 bits identify a virtual page within the 
segment, and the remaining 12 bits identify a byte 
within the page. 

By definition, addresses are not generally valid 
across address spaces. Regions can, however, be shared 
among multiple address spaces if each space loads a 
given segment register with the same 24-bit segment ID. 
As with most architectures, loading of a segment regis- 
ter is a privileged operation on the RS/6000. 


4.2 Implementation Overview 

PSL semantic features map rather directly to 
aspects of the PSL implementation. In each of the fol- 
lowing paragraphs the relationship between a specific 
PSL semantic feature and the relevant aspects of the 
prototype implementation is described. The relation- 
ships between semantic features and implementation 
aspects are summarized in Table 1. 


TABLE 1 SEMANTIC FEATURES AND 
IMPLEMENTATION ASPECTS 


PSL Semantics Implementation Aspects 


Shared Libraries Protection implemented via 

as protection address space manipulation; 

domains system loader maps libraries 
into separate address space 
regions; passive domains sim- 
plify stack management 


Partial address 
space switch on 
library call 


Partial address space switch 
performed by trampoline code 
upon entering and exiting PSL 
routine; PSL linker replaces 
library calls with traps to tram- 
poline-code 


Sharing between 
protection 
domains 


Shareable address space 
regions encapsulate shared 
information; trampoline-code 
uses architecture specific tricks 
to share pages between 
domains 


Modifications to system linker 
(and other components) ensure 
addresses used to map shared 
data in one domain are reserved 
in all domains 


Uniform address- 
ing and naming 
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4.2.1 Shared Libraries as Protection 


Domains 

The PSL implementation divides a process's 
address space into different regions. PSLs are mapped 
into these regions depending on the type of sharing and 
protection desired. Protection is provided by making 
different regions visible at any given instance. The 
implementation utilizes various hardware capabilities, 
such as page tables, segment registers and supervisor 
calls to control address space visibility. Division of the 
process-private address space into regions and mapping 
of libraries into these regions is the responsibility of the 
system loader. 


4.2.2 Address Space Switch on Library Call 


A partial address space switch is performed each 
time a thread enters a PSL via a protected entry point. 
The switch is performed by a small piece of privileged 
mode code called the trampoline code. The PSL linker 
ensures that threads trap to the trampoline code when 
calling a protected library entry point and upon return- 
ing from the PSL routine. 


4.2.3 Sharing between Protection Domains 
Sharing information between protection domains 
requires that certain address space regions be deemed 
shareable. CSLs are mapped into these shareable 
regions which in turn are then made visible to the appro- 
priate domains. As a thread traverses domains, shared 
address space regions are mapped and unmapped as 
needed depending on the type of sharing. This mapping 
and unmapping is performed by the trampoline code. 


4.2.4 Uniform Addressing and Naming 

In order for shared data to appear at the same 
address in different domains, addresses which map 
shared data in one domain must be reserved in all 
domains. This reservation is ensured primarily by the 
system loader. However, because address allocation in 
UNIX is not encapsulated within the loader or any other 
single component, several other pieces of the kernel had 
to be modified to ensure reservation. 


4.3 Implementation Components 

Implementing PSLs on AIX involved modifying the 
AIX kernel, C run-time libraries and programming 
tools. We now describe the primary aspects of the PSL 
implementation. 


4.3.1 Address Space Reservation 
Uniform sharing requires a portion of each 
domain’s address space be reserved. Unfortunately, in 
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AIX as with most UNIX implementations, address 
ranges are allocated independently by a number of ker- 
nel subsystems. The situation is further complicated in 
AIX by hard-coded starting addresses of code and data 
segments. To ensure address space reservation, almost 
all responsibility for address allocation was extracted 
from the various kernel subsystems and relocated to the 
system loader. In certain low-level assembly routines 
where this was not possible, address allocation logic 
was modified in place. 


4.3.2 Partial Address Space Switches 

The overhead associated with PSL protection is 
largely determined by the efficiency of the partial 
address space switches performed by the trampoline 
code. Typically an address space switch involves setting 
up page tables and flushing caches and translation 
lookaside buffers (TLBs). This process can be very 
costly depending on the number of pages involved and 
the size of the cache. Partial address space switches 
were implemented quite carefully in the prototype to 
minimize overhead and maximize performance. 

Page table entries that must be switched during a 
domain transition are preallocated. These entries are 
maintained in software using a sparse representation 
technique [Acetta et al. 86], so a large number of pages 
can be represented using a small number of entries. The 
trampoline code only switches a couple of pointers to 
incorporate these new entries into the software main- 
tained page tables. As pages are referenced in the new 
domain, the hardware page tables are lazily evaluated by 
updating them from the software tables. 

Certain sets of addresses are invalidated during a 
partial address space switch using architecture-specific 
techniques. Typically, cache and TLB contents are inval- 
idated during an address space switch to prevent cached 
data and translations from being used erroneously. To 
avoid the high cost of such invalidations, the PSL imple- 
mentation implements a partial address space by chang- 
ing segment register contents instead of modifying page 
tables. This prevents user-level threads from generating 
illegal addresses, while allowing for verify fast 
switches. Variations of this architecture-specific tech- 
nique has been used on other architectures as well 
[Liedtke 95]. 


4.3.3 Protected Shared Library Linker 

The Protected Shared Library linker subsumes the 
functionality of /bin/1d, the normal AIX linker. For 
binaries that are not and do not use PSLs, the PSL linker 
simply calls /bin/1d. For binaries that are PSLs or 
use PSLs, the PSL linker has three main responsibilities. 
First, wherever it detects calls to a protected library 
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entry point, the linker patches in a trap to trampoline 
code. Second, when creating a PSL, the linker adds 
descriptive information into unused field of the resulting 
binary. This information describes the kind of PSL, and 
type of sharing. It is used by the system loader to map 
the library into an appropriate address space region. 
Finally, the linker ensures PSL initialization and termi- 
nation routines are called as needed. Like many other 
shared library implementations [Dietel & Kogan 92], 
PSLs support sub-system and per-client initialization 
and termination routines. 


4.3.4 Protected Shared Library Loader 

The PSL loader replaces the AIX 3.2.5 system 
loader. In short, the loader implements most static 
aspects of PSL semantics. It creates multiple address 
space regions within private address spaces, maps librar- 
ies to these address space regions and generates per- 
domain information made available to the trampoline 
code. The PSL loader differs from the original AIX 
loader in two main ways. First, data mapping, symbol 
resolution and relocation were modified to ensure PSL 
semantics. Second, functionality was added to set up 
virtual memory data structures for each address space 
region a library is mapped into. 

Typical UNIX loaders map object module data sec- 
tions into a single address region, the process data seg- 
ment. In contrast, the PSL loader may map object 
module data sections into multiple address space 
regions. The exact address space regions that a data sec- 
tion is loaded into depends on the type of PSL. This 
mapping and subsequent resolution and relocation cre- 
ates multiple address space regions; one of which is the 
traditional data segment, the rest map PSL libraries as 
shown in Figure 2, Figure 3, and Figure 4. 

The PSL loader employes a number of data struc- 
tures to ensure visibility of address space regions in 
each protection domain. These structures include both 
hardware and operating system dependencies and are 
designed to allow path lengths through the critical tram- 
poline code to be minimized. 


4.3.5 Stack Management 

Passive protection domains simplify stack manage- 
ment during domain transitions. As in thread migration 
implementations [Bershad 90], there are two types of 
stacks. A system maintained activation stack ensures 
protected library calls can be nested. As a thread enters a 
new domain, state information for the calling domain is 
pushed onto the activation stack. Upon returning to the 
caller, the state information is restored from the activa- 
tion stack and the stack is popped. The second stack is 
the execution stack used by almost all run-time environ- 


ments. Because all secondary protection domains are 
passive, the execution stack of the calling thread moves 
with it across domains. Specifically, a thread’s execution 
stack gets unmapped from the calling domain and 
mapped into the called domain during a protection 
domain switch. This eliminates the complexity of 
dynamic stack allocation associated with most thread 
migration implementations [Bershad 90]. 


4.3.6 Resource Management 

The current PSL prototype transfers almost all 
resources from the calling to the called domain during a 
protection domain switch. The only exceptions are sig- 
nal related resources that are handled on the basis of sig- 
nal type to allow for error recovery after exceptions. A 
complete implementation of the PSL resource handling 
semantics would require significant modifications of the 
UNIX kemel. This is a reflection on resource handling 
in UNIX kernels, not on PSL semantics. 

Over the last decade or so, kernel code in most 
UNIX implementations has become quite structured. 
The vnode [Kleiman 86], HAT layer [Goodheart & Cox 
93] and the emerging UDI interfaces [UDI 96] ensure 
the file-system, low-level virtual memory management 
services and I/O system are accessed through well- 
defined interfaces. This provides some degree of encap- 
sulation and isolates clients of these interfaces from 
implementation changes. Furthermore, indirections such 
as those postulated by the stackable file systems stan- 
dard [Heidemann 95] can be easily implemented. This 
tends to make subsystem implementations more flexible 
and easier to maintain. 

Unfortunately the same cannot be said for process 
and resource management. Typically this is done 
through the u_block and proc structures and UNIX ker- 
nels are typically littered with direct access to these 
structures. Such direct access prevents any degree of 
encapsulation and makes it very difficult to build indi- 
rections or change UNIX resource handling. The need 
for encapsulation of resource handling in UNIX kernels 
has been previously recognized [Zajcew et al. 93]. 
Implementation of PSL based protection domains and 
process migration clearly require better encapsulation of 
resource handling and standardization in this area is 
strongly encouraged. 


4.3.7 Trampoline Code 

The trampoline code performs the partial address 
space switches required when a client calls and subse- 
quently returns from a protected shared library routine. 
The trampoline code performs six functions. 
¢ Stack management 
e Changing of address space visibility 
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e Handling of shared address space regions 
¢ Modification of non-memory resource accessibility 
¢ Passing of caller’s resources to called domain 
¢ Transfer of control to target entry point 

The trampoline code is responsible for almost all 
dynamic aspects of PSL semantics. It is by far the most 
performance critical part of the PSL implementation. 
Because of this, the code is written in assembly lan- 
guage and pinned in memory at run time. Furthermore, 
in the prototype AIX implementation, the trampoline 
code is accessed via a special trap handler that avoids 
much of the overhead of a typical UNIX system call. 


5. Performance 

This section sheds light on various aspects of PSL 
performance. In general, Protected Shared Libraries 
have been found to perform better than other popular 
forms of cross-domain cooperation. The next subsection 
begins with a comparison of null-RPC times. This is fol- 
lowed by a comparison of PSLs with other competitive 
protection schemes. Finally, the section ends with a 
breakdown of PSL call costs. A more thorough analysis 
of PSL performance can be found in [Banerji 96]. 


5.1 Null-RPC Benchmark 

Table 2 shows the null RPC times for several RPC 
implementations in AIX 3.2.5 on an RS/6000 Model 
530 with a relatively slow 25 Mhz POWER processor. 
The first two numbers are for classic user-level IPC; the 
third is for hand-off scheduling [Black 89]. The fourth 
number indicates thread migration is a bit slower than a 
PSL call due to resource handling overhead. The final 
two lines indicate the PSL null-call time is comparable 
to the time required for a null system call. 


TABLE 2 NULL RPC TIMES 


Impl ah Null RPC 
mplementation Timectins) 









| System call trap 
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5.2 Benchmarks 


This section evaluates five different protection 





Protection 
Boundary 


Service Code 
Client Code 


Figure 8 Client/Service Relationship 


schemes for six different benchmark tests. In each case, 
the benchmark code is built as a service which is 
invoked by a client. The goal is to evaluate the cost of 
protecting the service from the client. Figure 8 shows 
how each invocation has to cross from client code to ser- 
vice code, and indicates where we start and stop our data 
gathering. The benchmarks used were: 

¢ MDS, a secure one-way hash function developed to 
reliably identify long byte strings [Rivest 92]. The 
implementation used is based on code made avail- 
able by RSA [RSA 93]. The input byte string is par- 
titioned into fixed-length substrings, and the 
algorithm operates on the substrings in succession. 

e¢ Nsieve, a well-known benchmark that computes 
prime numbers. Problem size for Nsieve is the total 
number of primes to calculate; granularity is the 
number of numbers searched. The iterative portion 
of the Nsieve code was built as the service, so the 
number of invocations depends on the density of the 
primes. 

e tdbm_i, tdbm_f, and tdbm_d, three benchmarks 
involving the tdbm database, a small in-memory 
database based on the Berkeley UNIX ndbm 
library. This is a slight modification of the sdbm 
library released by Ozan Yigit [Yigit 92], and is 
based on the 1978 dynamic hashing algorithm by 
Paul Larson [Enbody 88]. Changes were made to 
avoid unnecessary copying and remove file depen- 
dencies. The tests involve insertion of N words 
from an extended version of /usr/dict/words, ran- 
dom fetch of N/2 words, and deletion of N/2 words. 

¢ — Nullc, a custom benchmark that is essentially a null 
call. The service is passed a block of data. It 
touches every data page and returns the data to the 
client. This test measures the base cost of transfer- 
ring variable-sized parameters between protection 
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domains. 


5.2.1 Protection Schemes 
The protection schemes include both classic and 
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new approaches. Three are hardware based and depend 
upon the kernel protection boundary. The other two are 
software-based approaches. The protection schemes are 
null-protection which is used as a baseline, traditional 
kernel-based system calls, process-based protection 
with thread migration, library-based PSLs, a language- 
based safe-subset of Modula 3 and software-fault-isola- 
tion [Wahbe 93]. 


5.2.2 Methodology 

Measurements were taken on an IBM RS/6000 
Model 390 with a single 66-MHz POWER2 processor. 
[Weiss 94]. For each benchmark, the granularity of the 
protection domain was varied, and the number of 
machine cycles needed to perform the service was 
recorded. Only plots for md5 and tdbm_d are shown 
here. For the md5 tests, the total message size was kept 
constant at 512 KBytes, and the number of bytes passed 
to the service with each invocation was varied. Figure 9 
shows the resulting performance as a function of prob- 
lem granularity. Increasing granularity increases results 
in fewer service invocations which causes the execution 
time of all schemes to decrease. For tdbm, the number 
of elements in the database was varied. Each invocation 
deals with one entry, but for larger databases the service 


does more work. The results for tdbm, shown in Figure 
10, differ significantly from the md5 curves in Figure 9. 
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5.2.3 Analysis 

A few points about the results shown in Figure 9 
and Figure 10 bear mention. First, PSLs outperformed 
thread migration. This is primarily due to cross-domain 
sharing, simplified stack management and improved 
resource management. Second, Figure 9 indicates that at 
higher problem granularities PSLs outperformed kernel 
based protection. With md5 this happens when the extra 
kernel trap and return of PSL interactions is outweighed 
by data copying costs of kernel interactions.3 Finally, in 
certain cases the PSL implementation may actually out- 
perform the unprotected case as shown in Figure 9. This 
is due to the PSL implementation of shared which 
allows data to remain in the cache between runs of dif- 
ferent processes, whereas ordinary shared library data is 
always faulted into the caches on a per-process basis (at 
least in UNIX). 


5.3 PSL Cost Breakdown 

Figure 11 shows a breakdown of the costs associ- 
ated with PSL-based protection compared to the case of 
no protection. Most of the overhead is due to aliasing, 


3. With kernel trap and return times steadily decreasing (11 
cycles in UltraSparc), and cache access latencies decreasing slowly, 
the break-even byte granularity is expected to decrease. 
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resolution of multiple virtual addresses to the same 
physical address. The POWER2 architecture provides 
hardware support to resolve aliases, but the support is 
not exploited by AIX 3.2.5. 
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5.3.1 Aliasing Cost 

Aliasing arises with hardware-based protection 
because client and service domains are different virtual 
address spaces. Consequently, virtual memory data 
structures must be updated when control is transferred 
between client and server. Also, references to shared 
data can cause TLB misses. The resulting alias faults 
dramatically impact transition overhead. These two 
costs, the in-kemel transition cost of updating aliasing 
data structures and extra faults due to aliasing, signifi- 
cantly impact PSL performance. 

Figure 11 indicates the in-kernel transition cost, 
which includes adjustments to aliasing data structure, is 
essentially the same for all benchmarks. For nullc and 
md5, only one domain actually touches the shared data, 
so there are no alias faults. For the other tests, however, 
both client and library code access the data which 
results in considerable time spent handling aliasing 
faults. Excluding time for alias faults, for five of the six 
tests, the PSL overhead lies between 1300 and 2500 
cycles. 

To assess the cost of updating aliasing data struc- 
tures, the number of instructions in the transition code 
was counted for the simplest service, nullc, with a four 
byte transfer. This code calls AIX routines, written in C, 
which adjust the virtual memory data structures. Thus, 
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the difference between the transition code instruction 
count (210) and the total instructions used to make the 
transition (1250) provides an indication of the cost of 
aliasing (1040 instructions approx). Hence, if we ignore 
the costs due to aliasing, which can be safely done for 
most modem hardware, the cost of a PSL call is actually 
the cost of executing 210 instructions and the kernel 
trap/return times. This cost is 275 cycles for the 66 MHz 
IBM POWER2. 


5.3.2 Trap costs 

Given that the aliasing problem can be solved in 
hardware, it is important to look at the other costs. 
These are approximately 210 instructions per transition, 
or about 275 cycles including kernel trap and return. 
This cost approaches those for highly optimized cross- 
domain wansfer mechanisms [Hamilton & Kougiouris 
93). 

Without aliasing, the hardware cost of the kernel 
trap and return is significant, 57 cycles for the 
POWER2. Library-based protection traps twice for 
every service invocation, thus 114 of the remaining 275 
overhead cycles are due to the trap and return. Some 
modern processors, such as the UltraSparc, have 
reduced trap overhead to as little as 11 cycles (that is, 22 
cycles for PSL-like double trap and returns) and trap 
costs are expected to continue decreasing. 


6. Discussion 

Shared libraries form an excellent basis of modular- 
ity for structuring large systems. Protected Shared 
Libraries enhance the popular notion of shared libraries 
in two ways, by adding protection and allowing data to 
be shared across protection boundaries. This enables 
PSLs to be used to securely implement sensitive ser- 
vices. Sharing reduces many of the costs of cross- 
domain interactions, thus making it a viable alternative 
to language-based or process-based protection schemes. 
A prototype PSL implementation has demonstrated the 
efficiency of the PSL approach. 

PSLs are well-suited for use in large user-level 
applications and for implementation of operating system 
services at user level as is common in microkernels. 
There are also less-obvious uses for PSLs. First, dynam- 
ically loadable kernel extensions suffer from their abil- 
ity to corrupt kernel data. Extensions could be prevented 
from corrupting data to which they do not need write 
access by building them as dynamically loadable privi- 
leged mode PSLs. Thus, PSLs may provide safe ways of 
extending existing operating system kernels. Second, 
one important research area in operating systems is the 
design of low-level nanokemels or exokernels [Engler 
95]. These kermels provide low-level protected inter- 
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faces to the hardware with most operating system func- 
tionality implemented in library routines. PSLs are an 
attractive approach to protecting such operating systems 
from application code. 
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Abstract 


In this paper we show how to extend the functional- 
ity of standard operating systems completely at the user 
level. Our approach works by intercepting selected sys- 
tem calls at the user level, using tracing facilities such 
as the /proc file system provided by many Unix oper- 
ating systems. The behavior of some intercepted sys- 
tem calls is then modified to implement new functional- 
ity. This approach does not require any re-linking or re- 
compilation of existing applications. In fact, the exten- 
sions can even be dynamically “installed” into already 
running processes. The extensions work completely at 
the user level and install without system administrator 
assistance. 


We used this approach to implement a global file sys- 
tem, called Ufo, which allows users to treat remote files 
exactly as if they were local. Currently, Ufo supports 
file access through the FTP and HTTP protocols and 
allows new protocols to be plugged in. While several 
other projects have implemented global file system ab- 
stractions, they all require either changes to the operat- 
ing system or modifications to standard libraries. The 
paper gives a detailed performance analysis of our ap- 
proach to extending the OS and establishes that Ufo in- 
troduces acceptable overhead for common applications 
even though intercepting system calls incurs a high cost. 


Keywords: operating systems, user-level extensions, 
/proc file system, global file system, global name space, 
file caching 
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1 Introduction 


Computer users have always had the desire to extend 
operating systems functionality to support new proto- 
cols or meet new usage patterns. In this paper we show 
how to extend a standard Unix operating system (So- 
laris) completely at the user level. Our approach — 
which is similar to interposition agents [Jon93] — uses 
tracing facilities to intercept selected system calls at the 
user level. The behavior of intercepted system calls is 
then modified to implement new functionality. We use 
this to implement Ufo,! a global file system, which sup- 
ports file access through the FTP and HTTP protocols 
and can easily be expanded to support new protocols. 

While this paper focuses on extending the file sys- 
tem services, our approach provides a general way of 
expanding operating system functionality without any 
kernel changes or library modifications. Our extensions, 
which are not just limited to Solaris, do not require any 
re-linking or re-compilation of existing applications. In 
fact, they can even be dynamically “installed” into al- 
ready running processes. Extensions can be added ona 
per-user basis, i.e., extensions for one user do not affect 
other users. Actually, even a single user couldrun differ- 
ent jobs with different extensions without interference. 

An important advantage of this method is that devel- 
oping the OS extensions can be done entirely at the user 
level and without access to OS source code. This makes 
our approach an excellent way for testing new kernel ex- 
tensions and for providing OS extensions that are not 
performance critical. 


1.1 Personalized Global File Systems 


With the recent explosive growth of the Internet an 
increasing number of users, including us, have access to 


1The acronym Ufo stands for User-level File Organizer. 
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multiple computers that are geographically distributed. 
The initial motivation for our work was the desire to have 
transparent file access from our Unix machines to our 
personal accounts at remote sites. In addition, we also 
wanted to present the resources from the large number 
of existing HTTP and anonymous FIP servers as if they 
were local files. This would allow all local applications 
to transparently access remote files. 

Ufoimplements a global file system that provides this 
functionality. It is a user-level process that runs on multi- 
user Unix systems and connects to remote machines via 
authenticated and anonymous FTP and HTTP protocols. 
It provides read and write caching with a weak cache 
consistency policy. 

It was important to us that the file system not only 
run at the user-level but that it also be user-installable 
(i.e. installing it does not require root access). For exam- 
ple, assume one of us obtains a new account at an NSF 
supercomputer center. Once we log into that account, 
we would like to transparently see all remote files we 
have some way of accessing (be it via telnet, FTP, rlogin, 
NFS, or HTTP); all without having to ask the system ad- 
ministrator to install anything. This is not necessarily 
easy to do in a Unix environment since most current file 
system software must be installed by a system adminis- 
trator. For example, systems such as NFS and AFS allow 
sharing of files across the Internet, but they require root 
access to mount or export new file partitions. The system 
administrator may not have the time or, due to security 
concerns, may not be willing to install a new piece of 
software or export a file system resource. 

A user-installable file system does not have these 
problems. Not only can users install it themselves, but 
it does not introduce any additional security holes in the 
underlying operating system or network protocol. To 
guarantee that a file system can indeed be installed by 
the user, it should only rely on functionality provided by 
standard (unmodified) operating systems. 


1.2 Personalizing the Operating System: The 
Ufo Approach 


In order to provide a global file system we need ex- 
tensions to the operating system that handle file accesses 
(and related functions) properly. By modifying the be- 
havior of the system calls, we can add new functionality 
to the operating system. In our approach we modify the 
system call behavior by inserting a user-level layer, the 
Catcher, between the application and the operating sys- 
tem. 

The Catcher is a user-level process which attaches to 
an application and intercepts selected system calls is- 
sued by the application. From the user's perspective, the 
Catcher provides a user-level layer between the user's 
application processes and the original operating system, 
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as shown in Figure 1. This extra layer does not change 
the existing OS, but allows us to control the user's en- 
vironment, either by modifying function parameters, or 
issuing additional service requests. 


UserA User A UserB 
Process Process Process 












User A specific OS layer! User B specific OS layer 1 User B specific OS layer 2 


Stendard Operating System 


Figure 1: A new view of the operating system. 





The Catcher operates as follows: Initially, it connects 
to the user process and tells the operating system 
which system calls to intercept. Our implementation 
which runs under Solaris 2.5.1 uses the System V 
/proc interface,2 which was originally developed for 
debugging purposes [FG91]. Instead of just tracing the 
system calls, we actually change at the user-level the 
semantics of some of them to implement the global 
file system. Whenever a system call of interest begins 
(or completes), the operating system stops the subject 
process and notifies the Catcher. The Catcher calls 
the appropriate extension function, if needed, and then 
resumes the system call. 

For our global file system, we intercept the open, 
close, stat, and other system calls that operate on files. 
When we intercept a system call which accesses a 
remote file, we first ensure that an up-to-date copy is 
available locally. Then, we patch the system call to 
refer to the local copy and allow it to proceed. System 
calls which only access local files are not modified (they 
can just continue), while most systems calls not related 
to files are not even intercepted. Since no application 
binaries are changed, this approach works transparently 
with any existing executable (with the exception of the 
few programs requiring setuid). 

A potential concern with our approach is its perfor- 
mance overhead. While the cost for intercepting system 
calls is significant, our performance analysis shows that 
Ufo introduces acceptable overhead for common appli- 
cations. 

Since Ufo runs filly at the user-level, if one user 
runs it there is no performance penalty on another user. 
Furthermore, a user can run Ufo only on some selected 
applications without impacting other applications, 


? Similar functionality is provided by Digital Unix, IRIX, BSD and 
Linux. This mechanism is used by system call tracing applications 
such as tuss or strace. 
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or even dynamically “install” (attach) or “uninstall” 
(detach) Ufo while applications are running. 


1.3 Using Ufo 


Installing Ufo can be done by any user without need- 
ing root assistance. The simplest way to start using Ufo 
is to explicitly start processes under its control, e.g. 


tcsh% ufo csh 

csh% grep UCSB http://www.cs.ucsb.edu/index.html 
csht cd /f£tp/schauser@cheetah.cs.ucsb. edu/ 

csh% emacs papers/ufo/introduction. tex & 


In the example above, the new shell running under 
Ufo can use the global file system's services. Ufo au- 
tomatically attaches to any child that the shell spawns, 
like the grep and emacs processes above. Alternatively, 
Ufo can be instructed to dynamically attach to an already 
running process by providing its pid. 


tcesh® emacs & 
(3] 728 
tesh% ufo -pid 728 


1.4 Outline 


The remainder of the paper is structured as follows. 
Section 2 reviews related work and compares our 
method for operating system extension with alternative 
approaches. Section 3 describes in detail the Catcher 
and how it intercepts system calls at the user-level. 
Section 4 discusses the design decisions of Ufo, the 
user-level global file system. Section 5 presents ex- 
perimental results for a variety of micro-benchmarks, 
standard Unix file system benchmarks, and full applica- 
tion programs. Section 6 concludes this paper and offers 
an outlook on future research directions. 


2 Related Work 


Before presenting implementation details of the 
Catcher (in Section 3), we will put our work in 
context by comparing our approach with alternative 
ways of extending operating system functionality. The 
eager reader can skip directly to the discussion of 
our implementation in Section 3. We first introduce a 
classification of different approaches to extending the 
operating system. We then discuss the relevant research 
projects on extending operating system and file system 
functionality in more detail. 


2.1 Approaches for Extending the Operating 
System 


There has been a considerable amount of work on ex- 
tending operating systems with new functionality. We 


can classify these approaches into the following cate- 
gories: 


Change Operating System: The most straightforward 
approach is to just modify the operating system it- 
self and incorporate the desired functionality. This 
requires access to the OS sources and the privileges 
to install the new kernel. 


Device Driver: Instead of changing the kernel itself, 
modifications can be limited to a new device driver 
which implements the desired functionality. Root 
access is required to install the device drivers. 


Network Server: A clean solution with minimal 
intrusion to the operating system is to install a 
network server, which provides the additional 
services through an already existing standardized 
interface. Installing the server and mounting 
remote directories requires root capabilities. 


We want to re-iterate, these first three approaches re- 
quire super-user intervention and affect everybody using 
the system, since everybody will see the modifications 
to the operating system. If there is a bug or security hole 
in the newly installed software, the whole system's in- 
tegrity and security can be compromised. A user-level 
approach avoids this problem. 


User-level Piug-Ins: When a one-time modification to 
the operating system can be tolerated, a flexible 
strategy is to add hooks to the operating system 
so that system calls can trigger additional functions 
that extend the functionality. This approach is es- 
pecially appropriate if the OS has already been de- 
signed to be flexible and support extensions. 


User-level Libraries (Static or Dynamic Linking): 
Most applications do not directly access the operat- 
ing system, but use library functions embedded in 
standard libraries. Instead of modifying all binaries 
or the OS kernel, it suffices to make changes to the 
libraries. Super-user privileges are only necessary 
if the original libraries/binaries need to be replaced. 


Application Specific Modifications: Instead of incor- 
porating the modifications into the library, we can 
also incorporate them directly into the application, 
avoiding the operating system altogether. 


Intercept System Calls: Most modern operating sys- 
tems provide the functionality of intercepting 
system calls at the user level. A process can be 
notified when another process enters or exits se- 
lected system calls. While the original motivation 
for this functionality was debugging and tracing 
of system calls, this mechanism can also be used 
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Examples and References 
Sprite [NWO88], Plan 9 [PPTT90] 

AFS, NFS, SLIC [GPA96] & WebFS [VDA96] 

ftp2nfs [Gsc94], Alex [Cat92] 

extended OS: SLIC [GPA96], UserFS [Fit96] 

flexible/extendible OS: SPIN [BSP* 94], Exokernel [EKO95], 
Newcastle Connection [BMR82], Prospero [NAU93], Condor [Con95] 
Jade [RP93], IFS [EP93] 


Interposition Agents [Jon93], Confinement [GWTB96], Ufo 














Table 1: Different methods of extending operating system functionality and examples. 
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root access 






















Re-link 
applications 


Range of 
applications 











Performance 
overhead 












all very low 

all very low 

all medium 

all low-high 
user low 
dyn. linked low 
single low 
all (no setuid) high 





Table 2: Different methods of extending operating system functionality and their limitations. 


to alter their behavior. This mechanism, which 
first was used in the context of Mach to implement 
interposition agents [Jon93], forms the basis for 
our Ufo implementation. 


Table I lists examples of the above approaches, while 
Table 2 summarizes their limitations and identifies the 
context in which they can be applied. We wanted an 
approach which works with most existing applications 
without the need for recompiling, and more importantly, 
which can be used without requiring root access. There- 
fore we decided to use the mechanism of intercepting 
system calls. 


2.2 Related OS Extensions 


The project that is the closest to our own is the work 
on interposition agents [Jon93] which also makes use of 
the mechanism of intercepting system calls. Interposi- 
tion agents provide a general system call tracing toolbox, 
which allows different system calls to be intercepted and 
handled in alternate ways, as we do in Ufo. Three ex- 
ample agent applications were implemented: spoofing 
the time of day, tracing system calls (as in truss), and 
transparently merging the contents of separate directo- 
ries. The interposition agents work is based on Mach. 
While Mach is a Unix variant, it was designed to be more 


flexible and extensible. In particular, when calls are in- 
tercepted in Mach 2.5, they can be redirected to the pro- 
cess’ own address space. Thus, the interposition agents 
are run in the user process’ own memory. Approaches 
that use a more standard Unix, such as ours, are more 
constrained (and more complicated to implement) since 
it is more difficult to access the user process state from 
outside of the process’ address space. 


Another research project that uses the Unix trace 
mechanism for implementing an OS extension is Janus 
[GWTB96] which provides a secure, confined envi- 
ronment for running untrusted applications safely by 
intercepting and selectively denying system calls. Like 
ours, the Janus implementation has been designed for 
Solaris. 


A lot of current research deals with designing operat- 
ing systems such that they allow for easier and more ef- 
ficient user-level extension. Engler et al. [EKO95] carry 
the Mach micro-kernel methodology [ABB* 86] further 
by removing as much kernel abstraction as possible from 
the OS. This pushes the kemel/user-level boundary as 
low as possible, placing most of the OS services out- 
side of the kernel. Another approach, taken by VINO 
[SESS94] and SPIN [BSPt 94], is to allow injection of 
user-written kernel extensions into the kernel domain. A 
discussion of the issues involved can be found in [SS96]. 
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Another recent project, SLIC [GPA96], is an OS exten- 
sion to Solaris that allows for plug-ins at both the user 
and the kernel level. 

We now discuss operating system extensions specific 
to our particular application: remote file transfer. 


2.3 OS Extensions for Remote File Systems 


There are a number of systems that provide trans- 
parent access to remote resources on the Internet, many 
of which have been very successful. Examples include 
NFS [SGK*t85], AFS [MSC+t86], Coda [SKK*90], 
ftpFS in Plan 9 [PPTT90] and Linux [Fit96], Sprite 
[Wel91, NWO88], WebFS [VDA96], Alex [Cat92}, 
Prospero [NAU93], and Jade [RP93]. They all have 
one significant drawback, however: they either require 
root access or modifications to the existing operating 
system, applications or libraries. Ufo is distinct in that it 
requires no such modifications to any existing code and 
runs entirely at the user-level. 

There are a few systems for global file access thatrun 
entirely at the user-level and are user-installable. They 
are also similar to Ufo in that they extend a local file sys- 
tem to provide uniform and transparent access to hetero- 
geneous remote file servers. Prospero [NAU93] and Jade 
[RP93] both provide access to NFS and AFS file sys- 
tems, and to FTP servers. Prospero runs at user-level by 
replacing standard statically linked libraries. This avoids 
changes to the operating system, but requires re-linking 
of existing binaries. Jade [RP93] uses dynamic libraries 
instead and allows most dynamically linked binaries to 
run unmodified. Changing application libraries works 
well for most applications, especially when combined 
with dynamic linking. The drawback of this approach 
is that it does not work for statically linked applications 
not owned by the user as well as for applications that cir- 
cumvent the standard librtaries and execute system call 
instructions directly. 

Other global file systems also run at the user level, but 
are not user-installable, since they require extensions to 
the operating system itself, which in turn requires root 
access. One such example is WebFS [VDA96], a global 
user-level file system based on the HTTP protocol. To 
run at the user level, WebFS relies on the OS extensions 
provided by SLIC [GPA96], which implements a call- 
back mechanism to a user process. (WebFS also requires 
the HTTP server be extended with a set of CGI scripts 
that service requests.) Similar to SLIC, UserFS [Fit96] 
is an OS extension that enables user-level file systems to 
be written for Linux. While installing UserFS itself re- 
quires kernel recompilation, installing new file modules, 
such as ftpFS, does not. Plan 9 [PPTT90] also includes 
an FTP based file system (also called ftpFS). At least two 
projects provide access to FIP servers by implementing 
an NFS server that functions as an FTP-to-NFS gateway. 


Alex [Cat92] supports read-only access to anonymous 
FTP servers, while [Gsc94] additionally allows read and 
write access to authenticated FTP servers. 


3 Catcher Implementation 


In this section we discuss the details of our imple- 
mentation of the Catcher inside Ufo. We start by de- 
scribing the high-level architecture and the role of the 
Catcher in Ufo. 


3.1 The Ufo Architecture 


Ufo is a user-level process which provides file sys- 
tem services to other user-level processes by attaching 
to them. Once attached to a subject process, it intercepts 
system calls and services them if they operate on remote 
files. The application is unaware of the existence of the 
Ufo, but, with Ufo's help, it can operate on remote files 


as if they were local. 
Ea 
i 


i 
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Figure 2: General architecture of Ufo. 


Ufo is implemented in two modules: the Catcher and 
the Ufo module (Figure 2). The Catcher is responsible 
for intercepting system calls and forwarding them to the 
Ufo module. The Ufo module implements the remote 
file system and consists of three layers: the File Services 
layer which identifies remote files, the Caching layer, 
and the Protocol layer containing different plug-in mod- 
ules implementing the actual file transfer protocols. 

Figure 2 shows the steps involved in servicing a re- 
mote file request. When the application issues a sys- 
tem call (1), it can go directly to the kernel or, if it is 
file-related, get intercepted by the Catcher (2). For in- 
tercepted calls, Ufo determines whether the system call 
operates on a remote or a local file, possibly using ker- 
nel services (3,4). If the file is local, the request pro- 
ceeds unmodified. If the file is remote, Ufo creates a 
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local cached copy, patches the system call by modifying 
its parameters, and lets the request proceed to the kernel 
(5). After the request is serviced in the kernel (6), the 
result is returned to the application (7). The return from 
the system call may also be intercepted and patched by 
Ufo, though the figure does not show this. 


3.2 Catcher Implementation Details 


In our Solaris implementation, the Catcher moni- 
tors user processes using the /proc virtual file system 
{[FG91]. This is the same method used by monitoring 
programs, such as truss or strace, which are also avail- 
able on a number of other UNIX platforms, including 
Digital Unix, IRIX, BSD or Linux. The System V /proc 
interface allows us to monitor and modify an individual 
process by operating on the file associated with a user 
process. 

In particular the Catcher attaches to a subject pro- 
cess pid by opening the /proc/pid file. Once attached, 
the Catcher uses ioctl system calls on the open file de- 
scriptor to control the process. It can instruct the oper- 
ating system to stop the subject process on a variety of 
events of interest. In Ufo there are two events of inter- 
est: system call entry into the kernel, and system call 
exit from the kernel. Once a subject process has stopped 
on an event of interest, the Catcher can read and write 
the registers and read and write in the address space of 
the process. The Catcher uses this to examine and mod- 
ify the parameters or the result of system calls like open, 
Stat, and getdents. Finally, the /proc interface allows us 
to restart the execution of a stopped process. Figure 3 
summarizes how the discussed functionality is used in 
the Catcher. 


connect to subject process 
register the set of system calls to intercept 
while subject process is running do 
wait for process to stop on system call entry or exit 
determine system call number and its parameters/result 
call handler function for this system call 
if system call is fork then 
begin monitoring new child process 
resume system call 
endwhile 


Figure 3: Outline of the Catcher algorithm. 


Conceptually, Ufo implements the system calls inter- 
cepted by the Catcher, but in practice Ufo does not ser- 
vice them directly. Although implementing the system 
calls directly in Ufo would be possible, it would require 
reimplementing existing OS functionality. Instead, we 
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“patch” the system calls by (i) modifying the call's pa- 
rameters, (ii) changing the file system state (e.g., fetch- 
ing a file from a remote server) and (iii) modifying the 
result returned from the operating system. A good exam- 
ple for the first two actions is the open system call. On 
the entry of an open call, we may have to modify the file 
name string to point to the locally cached copy. Before 
allowing the system call to continue, the Catcher may 
have to wait for Ufoto download the file from the remote 
site. Implementing the name change is somewhat com- 
plicated since we must modify the user's address space. 
We cannot just change the filename in place since the 
new filename might be longer than the old one. Also, 
the filename could be in a segment which is read-only or 
shared among threads. Currently we solve these prob- 
lems by writing the new file name in the unused portion 
of the application's stack and changing the system call 
argument to point to the new string. 

The open system call needs to be intercepted on exit 
from the kernel as well. Although the returned result 
is not modified, Ufo must remember the correspondence 
between thereturned file handle and the file name, which 
is needed when the file is closed. 

Besides file relatedsystem calls, there are several oth- 
ers that must be intercepted. For example, to track child 
processes we intercept the fork system call. Given the 
child pid, we can open its associated /proc file and mon- 
itor it as well. System V allows the set of trapped system 
calls to be automatically inherited from parent to child, 
so this setup is only needed for the initial process. 


3.3. User-Level Restrictions 


Implementations of file system functionality at the 
user level must obey some restrictions since user-level 
processes cannot perform arbitrary actions and cannot 
access the whole file system related state of the kernel. 

One problem is that the Catcher cannot control setuid 
process since the security policy of the operating system 
disallows user-level processes from attaching to other 
users’ processes. In practice we have found this not to 
be a problem since very few programs are installed with 
setuid. And for most of those programs, e.g., rlogin, it 
is not clear whether one really needs the file system ex- 
tensions. In the current implementation, whenever the 
Catcher detects that a subject process is about to spawn 
a setuid program, it just does not trace the child process. 

In Solaris, the /proc interface allows the controlling 
process to write in the subject process' address space. 
This is important if the Catcher needs to change some 
system call arguments such as filename strings for Ufo. 
If we are to port the Catcher to other operating systems 
that do not provide the capability of writing into the sub- 
ject process, we would not be able to implement this fea- 
ture. Although writing in the user process is not neces- 
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sary for the basic functionality of Ufo, a Ufo implemen- 
tation on such an operating system would have some 
limitations. Features such as the URL naming scheme 
and mountpoints in the root directory require changing 
of string arguments to system calls and therefore would 
not be possible to implement (see Subsection 4.1). 


Another problem arises when the Catcher process is 
killed. Being a regular user-level process, the Catcher 
cannot protect itself against the SIGKILL signal. There 
is no graceful way to handle such a situation if the sub- 
Ject processes running under the Catcher continue work- 
ing on remote files. In the current Ufo implementation 
the subject process will be trapped on the next inter- 
cepted system call and stay trapped until killed. 


3.4 Catcher Discussion 


The Catcher mechanism allows us to create a person- 
alized operating system. Requests made to the kernel 
can be re-interpreted, in effect allowing individual users 
to run their own OS. Any user can use the “new” OS 
without having to modify the original operating system 
or needing root access. Although, the current Catcher 
only intercepts system calls, System V allows the user 
to also intercept and act on signals and hardware faults. 
This allows for a wide range of OS functionality to be 
extended using the Catcher mechanism. Other potential 
uses of the Catcher for personalized OS extensions in- 
clude encrypting file systems, file systems which store 
files in compressed form, confined execution environ- 
ments for runing untrusted binaries [GWTB96], virtual 
memory paging [DWAP94, FMP 95], and process mi- 
gration [Con95)]. 


A potential concern with our approach is its perfor- 
mance overhead. Indeed, intercepting individual system 
calls is quite expensive and for some OS extensions this 
overhead would be unacceptable. Nevertheless, Ufo is 
an example that there are OS extensions for which the 
Catcher mechanism works well. Our performance anal- 
ysis shows that Ufo introduces moderate overhead for 
common applications, This is due to the fact that typical 
applications issue relatively few system calls, and not all 
system calls are intercepted in Ufo. 


4 Ufo's Global File System Module 


Ufo provides read and write access to FIP servers 
and read-only access to HTTP servers. The remote file 
access functionality is implemented in Ufo's file system 
module which is responsible for resolving remote file 
names, transfering files, and caching. 


4.1 Naming Strategies 


Ufo supports three ways of specifying names of re- 
mote files: (i) through a URL, (ii) through a regular file- 
name implicitly containing the remote host, user name, 
and access mode, and (iii) through mount points. 

The first way to specify a remote file is through its 
URL syntax. Unfortunately, some applications cannot 
handle URL names. Make and gmake cannot handle the 
colon in the URL, while Emacs considers // to be the 
root of the file system and thus discards everything to 
the left. 

To alleviate these problems we also support spec- 
ifying a remote file through a regular filename. The 
general syntax is /protocol/user@host/filename 
where protocol is the file transfer protocol, e.g., ftp 
orhttp. 

Lastly, Ufo allows the user to specify explicit mount- 
points for remote servers or access protocols in a .uforc 
file. For example, the line 


local /csftp remote / 
machine ftp.cs.ucsb.edu method FTP 


specifies that accesses relative to /csftp refer to the 
root directory of the ftp.cs.ucsb.edu anonymous 
FIP server. The user can also specify mountpoints 
for access methods. In fact that is how the second 
naming scheme is implemented: if the user does not 
explicitly specify a mount point for the HTTP method, 
for example, Ufo uses the implicit mountpoint: 


local /http method HTTP 


Similarly to Sprite [NW0O88], we have implemented 
mount points using a prefix table which, given a file- 
name, searches for the longest matching prefix in the list 
of mount points. 

Ufo also supports symbolic links. A user can create 
links to frequently accessed remote directories. While 
links simplify accesses to remote files, they actually 
present quite an implementation challenge, since they 
require following all link components to determine the 
true name of a file. 


4.2 Accessing Remote Files and Directories 


Ufo transfers only whole files to and from the remote 
file system. Whenever Ufo intercepts the open system 
call for a remote file, it ensures that a local copy of the 
file exists in the cache, and then redirects the system call 
tothelocal copy. Read and write system calls don’t even 
have to be intercepted since they operate on file descrip- 
tors returned by the open; they will correctly access the 
local copy in the cache. Finally, on a close system call, 
Ufo checks whether the file has been modified and if so, 
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stores the file back to the server (the store may be de- 
layed if write-back caching is in effect). Ufo uses whole 
file transfers for two reasons: this minimizes the number 
of system calls that need to be intercepted, and protocols 
such as FTP only support whole file transfers. 

When an application requests information about a re- 
mote file, e.g., through a stat or [stat system call, Ufo 
satisfies the request by creating a local file stub and redi- 
recting the system call to it. The file stub has the correct 
modification date and size of the remote file but con- 
tains no actual data.? With this approach Ufo neither 
has to re-implement the stat system call, nor download 
the whole file. Only if the application wants to open a 
file stub later, will Ufo actually download the remote file. 
Similarly, when asystem call suchas getdents (get direc- 
tory entries) is issued on a remote directory, Ufo creates 
a copy of the directory in the local cache and puts file 
stubs in it. Then, it redirects the system call to the so 
created skeleton directory. 


4.3. Caching and Cache Consistency 


Since remote data transfers can be quite slow, Ufo im- 
plements caching of remote files to achieve reasonable 
performance. Instead of downloading a file each time the 
user opens it for reading, Ufo keeps local copies of pre- 
viously accessed files. Ufo can reuse the local copy ona 
subsequent access, as long as it is up-to-date. Similarly, 
we use write-back caching which delays writing a mod- 
ified file back to the remote server. While files are the 
primary objects cached, Ufo also caches directory infor- 
mation (directory contents), and file information (size, 
modification time, permissions). The FIP module addi- 
tionally caches open control connections. Since estab- 
lishing a new connection to the remote server for each 
transfer is expensive, we reuse open control connections 
by keeping them alive fora period of time after a transfer 
has completed. 

The cache consistency policy governs whether we are 
allowed to use a local copy on a read, and whether we 
can delay the write-back of a modified file. To efficiently 
support a wide range of usage patterns, Ufo provides an 
adjustable consistency policy based on timeouts (a read 
and write delay). The policy guarantees that (i) when 
a file is opened it is no more than T;.qq seconds out 
of date; and (ii) changes made to a file will be written 
back to the server within T4,;¢¢ seconds after the file is 
closed. To verify that a local file is up to date (Le., is 
not stale), Ufo checks whether the file on the remote site 
has changed (validate on open). Tread and Tyrite Can 
have a zero value. In this case files opened for reading 


3Creating a stub is done by seeking to the desired position in a 
newly created file and then writing a single byte. On most file systems, 
the so created stub occupies a small amount of disk space, independent 
of the reported file size. 
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are never stale and modified files are written back to the 
server immediately after they are closed. 

The write timeout of a file is always a certain number 
of seconds. The read timeout can optionally be specified 
as a percentage of the file's age as in Alex [Cat92]. This 
method is based on the observation that older files are 
less likely to change than newer files. Therefore older 
files need to be validated less often. Files can have in- 
dividual timeouts and Ufo provides mechanisms for the 
user to define default timeouts for all files, or for all files 
on a server. This allows the user to adjust the tradeoff 
between performance and consistency based on known 
usage patterns. For example, when mounting read-only 
binaries large read timeouts can be used since these files 
change rarely. 


4.4 Authentication and Security 


Ufo relies on the underlying access protocols for au- 
thentication. Currently, passwords are only required for 
authenticated FTP servers and are not needed for HTTP 
and anonymous FTP accesses. Ufo allows the passwords 
to be stored in the .uforc or -netrc files, or alternatively, 
Ufo asks for the password on the first access to a remote 
server. 

Since Ufo is running entirely at the user level with 
the access permissions of its owner, it does not introduce 
new security problems in the system. The only potential 
security concern is to ensure that other users do not gain 
undesired access to the files in the private Ufo cache. 
We avoid this problem by creating the topmost cache 
directory with read and write permissions for the owner 
only. 


4.5 Implementation Trade-Offs 


In implementing Ufo, we tried to minimize the 
amount of operating system functionality that we had 
to reimplement. First, we attempted to minimize the 
number of intercepted system calls in order to minimize 
the execution overhead that Ufo introduces. This lead 
to the whole file caching policy. Second, we wanted 
to minimize the implementation effort by modify- 
ing/reimplementing as few system calls as possible. 
This lead to our decision to create file stubs and skeleton 
directories for the stat and getdents calls. 

Of course, there is a trade-off between execution 
overhead and implementation effort. For example, the 
advantage of creating file stubs and skeleton directories 
is that we do not have to reimplement the stat and 
getdents system calls. The disadvantages are that 
creating file stubs may have high overhead. Also for 
efficiency, we rely on the support for holey files by 
the local file system. For example, on our machines 
the /tmp file system does not support holey files, thus 
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if we use /tmp for the Ufo cache the stubs for large 
files do use all the disk space indicated by their size. 
The NFS-mounted file systems at our site do support 
holey files, but the stub creation there is an order of 
magnitude slower than on /tmp. For these reasons we 
are considering implementing the stat and getdents 
system calls completely inside the next version of Ufo 
to improve its performance. In fact, we are already par- 
tially implementing (patching) the getdents system call 
in order to support Ufo mountpoints in user-unwritable 
areas such as the root directory. 

Transferring only whole files introduces three well 
known problems for extremely large files [Cat92]. First, 
when only a small fraction of a file is actually accessed, 
a lot of unnecessary data may be transferred. Second, 
the whole file has to fit on the local disk. In practice 
we don't expect these two problems to occur frequently. 
With the exception of databases, most applications tend 
to access files nearly in their entirety [BHK*91]. Fur- 
thermore, Ufo allows any local file system to be used 
for file transfers, thus reducing the danger of insufficient 
local disk space. A third problem comes from our de- 
cision not to intercept the read and write system calls. 
In our approach the open call blocks until the whole file 
has been transferred. It is possible to intercept and han- 
dle read and write system calls in Ufo. The benefit is 
that open would not always block:* reads that operate 
on the already present part of a file could be executed 
without waiting for the completion of the whole transfer 
(see Alex [Cat92]}). The drawback is that intercepting 
read and write calls incur a high overhead and requires 
extra implementation effort. 


5 Performance Measurements 


The main goal of our performance analysis is to mea- 
sure the overhead introduced by the Catcher mechanism 
in Ufo. This information is necessary to determine the 
usability of our method for operating system extension. 

We first present the results of several microbench- 
marks, which measure the overhead of intercepting in- 
dividual Unix system calls. To demonstrate the overall 
impact of this overhead on whole applications we also 
present measurements for a set of file system bench- 
marks and a set of real-life applications. While the mi- 
crobenchmarks show that intercepting system calls is ex- 
pensive, the real-life applications exhibit much lower 
overhead. 

All tests were run ona 143 MHz Sun Ultra | worksta- 
tion with 64 megabytes of main memory running Solaris 
22541. 


4Several other system calls like /seek also have to be intercepted 
for this to work. 


5.1 Microbenchmarks 


The microbenchmark results present the  user- 
perceived run times (measured as wall clock times) 
for open, close, stat, read, write, and getpid system 
calls. The results are shown in Table 3. The columns 
show the numbers for the normal user program, for the 
Catcher-monitored program (Catcher only, no calls to 
Ufo functions), and for the Ufo program (Catcher and 
Ufo functionality). In the latter case, we examine the 
run times for a local file, for a cached remote file and 
for a remote file that has not been cached. 

The Catcher only and Ufo local file numbers are of 
special significance. They show the cost of running a 
process under the Catcher or under Ufo when the pro- 
cess accesses local files only and does not require any of 
the extended OS functionality. This is the the fundamen- 
tal overhead introduced by our method of extending the 
OS. The numbers for remote files are a measure of the 
combined effect of our remote file system implementa- 
tion, our caching policy, the efficiency of the underlying 
access protocol (FIP in this case), and the quality of the 
network connection. 

In order to measure the cost of the Solaris system 
calls themselves and not the network speed or the 
NFS overhead, we used the local /tmp file system. 
Accesses to /tmp are very fast and do not involve disk, 
network traffic or protocol overhead. As a result the 
microbenchmarks present the Catcher and Ufo overhead 
in the worst-case scenario. The relative Catcher and 
Ufo overhead for accessing non-cached NFS files, for 
example, is much lower. 

The microbenchmarks were run on a lightly loaded 
workstation by taking the wall-clock time just before and 
just after the system call. The timing was done using the 
high resolution timer gethrtime which has a resolution 
of about 0.5 microseconds on the Ultra 1 workstation. 
Since individual system calls are very fast, normal sys- 
tem activity such as interrupts and context switches dis- 
torts some of the measurements. This produces a small 
percentage of outliers that are several times larger than 
the rest of the measurements. To ensure we do not in- 
clude unrelated system activity in our measurements, in 
each test run we recorded 100 measurements and dis- 
carded the highest 10% of them. The remaining times 
were then averaged. The numbers in the table are the 
arithmetic mean of five such runs. The standard devia- 
tion for the five runs was below 2% for all tests, except 
for getpid, for which the standard deviation was at most 
6%. 

The Catcher only numbers show the cost of intercept- 
ing system calls. The results are obtained by running 
the benchmark program under the control of the Catcher 
alone. The Catcher simply intercepts the open, close and 
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System Call Standard OS 


open 
close 
Stat 


getpid 
write Ib 
read |b 
write 8K 
read 8K 





Ufo 
local file 


Ufo 
remote cached 


Ufo 
remote no cache 


53lms (18964) 
452ms (37667) 
198 ms (6000) 
5 ps (1.67) 
26 ps (1.13) 
28 ps (1.12) 
101 js (1.04) 
79 us (1.05) 


Table 3: Run times in microseconds for various system calls for accessing files in/tmp (the numbers are the arithmetic 
mean of 5 runs, each executing 100 iterations). The numbers in parentheses represent the ratio normalized to the 


standard Solaris OS. 


Stat system calls executed by the benchmark program, 
and lets them continue immediately without modifying 
them. The read, write and getpid system calls are not in- 
tercepted at all. Even though one may expect that these 
system calls will not be affected, they do incur a small 
overhead: whenever there is even a single intercepted 
system call for a process, the operating system takes a 
different execution path for all system calls of that pro- 
cess, independent of whether they are intercepted or not. 
The results demonstrate that for read and write of 1 byte 
blocks this overhead is small and for 8K blocks it is neg- 
ligible. Because gerpid is so fast, it has a substantial rela- 
tive overhead, but still only 2 js total. On the other hand, 
system calls that must be trapped by the Catcher incur a 
factor of 4-9 overhead. During this extra time, control 
is passed from the program to the Catcher, (which per- 
forms ioctl calls to read information from the /proc file 
system), and then back again. 


The Ufo local file column shows how much extra 
overhead is introduced by Ufo in addition to the Catcher. 
The benchmark program is running under Ufo and is 
accessing local files only. Even though no remote files 
are accessed, Ufo still introduces some overhead in 
addition to the Catcher overhead. The extra overhead 
comes from the analysis of the parameters of the 
intercepted system calls. For system calls that reference 
a file, Ufo determines whether the file is indeed local 
or remote. Since a system call does not necessarily 
take an absolute path name as an argument, Ufo has 
the responsibility of determining it. Determining the 
true filename can involve a number of star system calls, 
similar in flavor to the pwd command, and this can add 
a noticeable overhead. 


The remaining two columns measure the overhead of 
Ufo when working with remote files. These numbers are 
measured as with Ufo local file, except that the accesses 
are to remote files. For the Ufo remote cached tests, a 
locally cached copy of the remote file is accessed. Note 
that in either case (cached or uncached), the read and 
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write system calls operate on the locally cached copy of 
the file. Thus, these numbers are consistent across all 
of the tests. On the other hand, open and stat calls to 
uncached remote files require remote accesses, and the 
overhead increases dramatically when Ufo uses the FTP 
protocol to retrieve the file. This overhead is almost en- 
tirely determined by the quality of the network connec- 
tion and the FTP protocol. In our measurements we ac- 
cessed files located at UC Berkeley. From a UC Santa 
Barbara machine, opening a remote file of size 1024 
bytes residing at a UC Berkeley host requires 531ms 
using FTP. Closing the same remote file after modifying 
it takes 452m since the file must be written back to the 
remote server. If the file is cached, the open, close and 
Stat overhead is much smaller, but it still has roughly 
four times the overhead compared to a local file. This is 
due to two reasons: the additional work to manage the 
cache, and several remaining inefficiencies in our pro- 
totype implementation which will be corrected in future 
versions of Ufo. 


5.2 File System Benchmarks 


Table 4 reports the absolute execution times in 
seconds for two file system benchmarks run on the 
local /tmp file system with and without Ufo and on 
a remote FITP-mounted file system with and without 
caching. For these tests, the FTP host was a machine 
on the local 100Mbit/s Ethernet network. The remote 
tests with caching were with a warm cache and read and 
write delays set to infinity. Thus, these measurements 
represent the best-case scenario for remote files. For 
the remote tests without caching, the read and write 
delays were set to zero, forcing every open, close and 
stat system calls to go to the remote site. These tests are 
the worst-case scenario for accessing remote files under 
Ufo. 

lostone and Andrew are standard file system bench- 
marks. We chose these as examples of applications that 
execute a lot of file system calls that Ufo intercepts 
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System calls 
total 


| iostone 
andrew 
makedir 
copy 
scandir 
read all 


make 
total 


48762 


System calls |} Standard OS Ufo Ufo 
intercepted local file remote cached 


[3s (2.00) 





Ufo 
remote no cache 


77144 = (2581) 


Table 4: Run times for the Iostone and Andrew file system benchmark programs with and without Ufo. Times are 
in seconds, with the ratios normalized to the standard OS shown in parenthesis. (The Andrew benchmark reports 
its timing results with a resolution of 1 second. The 0 seconds in the table indicate a measurement between 0 and | 


second.) 


and handles. The Jostone benchmark [IOS87] performs 
thousands of file accesses (opening, reading, and writ- 
ing). Because of the large amount of file opens and 
closes, Ufo runs about 8 times slower on the local file 
system. The Andrew benchmark [HKM*88] measures 
five stages in the generation of a software tree. The 
stages (i) create the directory tree, (ii) copy source code 
into the tree, (iii) scan all the files in the tree, (iv) read 
all of the files, and finally (v) compile the source code 
into a number of libraries. For this benchmark the Ufo 
overhead on local files is a factor of 1.33, much lower 
than the overhead for Jostone. For both Andrew and 
Tostone, the results for the uncached remote tests are 
orders of magnitude worse than for the local /tmp file 
system. This is not not surprising since the network 
latency and the FTP protocol overhead are quite large 
compared to the fast accesses in /tmp. 


5.3. Application Programs 


We also tested Ufo with a number of larger Unix ap- 
plications: latex, ghostscript, a make of the Ufo exe- 
cutable, and the integer applications from the SPEC9S 
benchmark suite. The results are shown in Table 5. As 
with the file system benchmarks, each test was run with- 
out Ufo, under Ufo on local files only, and under Ufo on 
remote files with and without caching. 

The first set of benchmarks are programs that we run 
frequently. The /atex test measures the time to latex three 
times a 20 page paper consisting of 8 tex files and then 
produce a postscript from the dvi file. The make test 
compiles Ufo itself using g++. The ghostscript test dis- 
plays a 20 page postscript document. The table shows 
that /atex and make perform a relatively large number of 
system calls that Ufo intercepts, mainly open, close, and 
stat. This results in Ufo overheads of 24% and 22% re- 
spectively, when run locally, and higher overheds, when 
run remotely. The remote overheads, while large, should 


be acceptable to the user, since accessing remote files is 
expected to cost extra time. The local overheads on the 
other hand, are incurred only because the application is 
running under Ufo even though it is not using any of its 
functionality. To avoid unnecessary local overhead, ap- 
plications that only access local files can be run without 
Ufo, and Ufo can be detached from applications once 
they stop accessing remote files. 

The ghostscript test on the other hand performs few 
calls that Ufo intercepts and never writes to the remote 
server; as a result the Ufo overhead is very low even in 
the remote test. This sort of overhead should be unno- 
ticeable to the user. 

The last eight tests are the integer applications from 
the SPEC9S benchmark suite. These were chosen as ex- 
amples of compute intensive applications that do not per- 
form extensive file system operations. For these appli- 
cations the observed overhead is very small in the local 
and even in the remote tests. Small perceived overheads 
should also be expected for interactive applications such 
as text editors since the user is not likely to notice the 
difference between 2815 and 61 Ijzs when opening a lo- 
cal file. 


5.4 Summary of Experimental Results 


As expected, we find that intercepting system calls 
can be very expensive, and remote accesses are orders 
of magnitude higher than local accesses. For programs 
such as the Jostone benchmark — which performs many 
open, close and stat calls — the Ufo overhead for local 
files is too large to be ignored. Clearly, such programs 
should not be run under Ufo if they only access local 
files since this will incur a large overhead even though 
the program does not utilize any of the extended func- 
tionality. If remote files need to be accessed, then pro- 
grams like Iostone will run slow, but this is mainly due to 
the network latency and access protocol overhead which 
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Application 


System calls | System calls 
total intercepted 


latex 

make 
ghostscript 
0.99.g0 
124.m88ksim 
126.gcc 
129.compress 
130.1i 
132ijpeg 

134. pert 

147. vortex 





Standard OS Ufo Ufo Ufo 


local file remote cached | remote no cache 
16.18 (1.24) | 17.6 (1.35) 
36.5s (1.22) | 38.5 (1.24) 


43s (1.05) | 4.4 (1.07) 


Table 5: Relative run times for some file system benchmarks and larger Unix applications. Times are in seconds, and 
the relative speed in parentheses. The first column shows the number of system calls executed by the application. 


by far outweighs the Catcher and Ufo overhead as shown 
in Table 3. In this case Ufo proves to be a convenient 
tool. Furthermore, Ufo allows the user to dynamically 
attach to a running process and detach from it, so the 
choice of running under Ufo or not is always available. 
Applications that need access to remote files can have it, 
and the remaining processes will not incur any overhead. 


Other applications, such as make, and /atex incur a 
22-24% overhead on local files — noticeable, but per- 
haps acceptable to the user even when the functionality 
of Ufo is not required. For remote files these applica- 
tions incur overheads of 24% in the best case and 600% 
in the worst case depending on the kind of file caching 
used. In most cases the user expects that working on 
remote files would be slower, so the use of the extra 
functionality provided by Ufo should be worth the ad- 
ditional overhead, especially when the only alternative 
is to manually transfer files using FTP. Many other ap- 
plications, such as compute intensive programs or text 
editors, make infrequent use of the system calls trapped 
by Ufo (though they may use other calls such as read 
and write). For such applications, user-perceived delays 
are much smaller: on the order of a few percent. In this 
case, running applications under Ufo makes no appre- 
ciable difference. 


From these observations we can draw the conclusion 
that the Catcher is a good tool for implementing operat- 
ing system extensions that require the interception only 
of relatively infrequent system calls. An example of 
such an extension is the Ufo file system when running 
real-life applications. On the other hand, this method is 
not ideal for extensions which intercept frequently oc- 
curting system calls. 


Finally, we would like to mention that we are aware 
of some opportunities for improving the current Catcher 
prototype and many opportunities for improving Ufo. 
For example, optimizing the filename check in Ufo (to 
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determine whether a file is local or remote) alone will 
result in a significant reduction of the running time. 


6 Conclusions and Future Work 


In this paper, we presented a general way of extend- 
ing operating systems functionality, using the debugging 
and tracing facilities provided by many Unix operating 
systems. Selected system calls are intercepted at the 
user-level and augmented to obtain the desired function- 
ality. This mechanism forms the basis for Ufo, a file sys- 
tem providing transparent access to remote files on FTP 
and HTTP servers. Ufo proved to be a useful tool which 
we now use daily. As our experimental results show, its 
overhead, while quite large for intercepted system calls, 
is acceptable for most applications. 

We believe that our approach is a promising way for 
individual users to develop and experiment with future 
operating system extensions, since this can be done com- 
pletely at the user-level. Essentially, each user sees a 
personalized version of the operating system, extensions 
do not affect other users and are compatible with ex- 
isting applications as those need not be re-compiled or 
re-linked. In the past, operating systems research had a 
hard time to carry over to the general public. With our 
approach, researchers can make their extensions easily 
available, and users can run them without relying on the 
system administrator for installation. 

There are plenty of avenues for future work and re- 
search. For example, we have several ideas on how to 
improve the performance of the Catcher and Ufo. We 
also plan to implement new protocol modules in Ufo, 
e.g. based on NFS, WebNFS, and the rlogin protocols. 
We have experimented with several other OS extensions 
suitable forcluster of workstation environments. For ex- 
ample, we have developed a prototype that attaches to a 
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process, checkpoints it, and then can restart it at a later 
time or migrate it to another processor. Similiarly, we 
have a prototype Catcher which intercepts all forks and 
execs and sometimes decides to execute some processes 
on other workstations. While both tools are still ata very 
crude stage, we have already seen some of their poten- 
tial benefits. Similar benefits can be expected for paging 
virtual memory to the memory of idle processors instead 
of toa slow local disk. 

Another interesting research area is protected com- 
puting. The system calls define the capabilities a process 
has and resources it can obtain (memory, disk access, 
CPU time). We can use the Catcher to limit the resources 
a process can access or obtain. This approach, imple- 
mented in Janus [GWTB96], is especially interesting in 
the current development of global computing, where one 
user may run an untrusted binary fetched from the Inter- 
net. 

Finally, we intend to generalize our design of the 
Catcher since it can not only intercept system calls, but 
also signals and hardware traps which are delivered to 
the application. We intend to build a Catcher toolbox 
which can be used for OS courses and research projects. 
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Abstract 


In this paper, we investigate network-aware mobile pro- 
grams, programs that can use mobility as a tool to adapt 
to variations in network characteristics. We present infras- 
tructural support for mobility and network monitoring and 
show how adaptalk, a Java-based mobile Internet chat 
application, can take advantage of this support to dynami- 
cally place the chat server so as to minimize response time. 
Our conclusion was that on-line network monitoring and 
adaptive placement of shared data-structures can signifi- 
cantly improve performance of distributed applications on 
the Internet. 


1 Introduction 


Mobile programs can move an active thread of control 
from one site to another during execution. This flexibility 
has many potential advantages. For example, a program 
that searches distributed data repositories can improve its 
performance by migrating to the repositories and perfomn- 
ing the search on-site instead of fetching all the data to its 
current location. Similarly, an Internet video-conferencing 
application can minimize overall response time by position- 
ing its server based on the location of its users. Applications 
running on mobile platforms can react to a drop in network 
bandwidth by moving network-intensive computations to a 
proxy host on the static network. The primary advantage 
of mobility in these scenarios is that it can be used as a 
tool to adapt to variations in the operating environment. 
Applications can use online information about their oper- 
ating environment and knowledge of their own resource 
requirements to make judicious decisions about placement 
of computation and data. 

For different applications, different resource constraints 
are likely to govern the decision to migrate, e.g. net- 
work latency, network bandwidth, memory availability, 


*This research was supported by ARPA under conwact #F19628-94- 
C-0057, Syracuse subconwact #353-1427 


server availability. In this paper, we investigate network- 
aware mobile programs, i.e. programs that position them- 
selves based on their knowledge of network characteristics. 
Whether the potential perforinance benefits of network- 
aware mobility are realized in practice depend on answers 
to three questions. First, how should programs be struc- 
tured to utilize mobility to adapt to variations in network 
characteristics? In particular, what policies are suitable 
for making mobility decisions? Second, is the variation in 
network characteristics such that adapting to them can be 
profitable? Finally, can adequate network information be 
provided to mobile applications at an acceptable cost? 

In order to adapt to network variations, mobile programs 
must be able to decide when to move, what to move and 
where to move. There are three types of network variations 
which may be cause for migration: (1) population varia- 
tions, which represent changes in the distribution of users 
on the network, as sites join or leave an ongoing distributed 
computation; (2) spatial variations, i.e. stable differences 
between in the quality of different links, which are primar- 
ily due to the hosts’ connectivity to the Internet; and (3) 
temporal variations, i.e. changes in the quality of a link 
over a period of time, caused presumably by changes in 
cross-traffic patterns and end-point load. Spatial variations 
can be handled by a one-time placement based on the in- 
formation available at the beginning of a run. Adapting to 
temporal and population variations requires dynamic place- 
ment which needs a periodic cost-benefit analysis of current 
and alternative placements of computation and data. Dy- 
namic placement decisions have two partially conflicting 
goals: maximize the performance improvement from mo- 
bility and minimize the cost of mobility. If an opportunity 
for improving performance presents itself, it should be cap- 
italized upon; however, reacting too rapidly to changes in 
the network characteristics can lead to perforinance degra- 
dation as the performance gain may not offset the mobility 
cost. 


We investigate these issues in the context of Sumatra, 
an extension of the Java! programming environment [10] 


' Java is a registered wademark of Sun Microsystems. 
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that provides a flexible substrate for adaptive mobile pro- 
grams. Since mobile programs are scarce, we developed 
a mobile chat server for our experiments. This applica- 
tion, called adaptalk, monitors the latencies between all 
participants and locates the chat server so as to minimize 
the maximum response time. We selected this application 
since itis highly interactive and requires fine-grain commu- 
nication. If such an application is able to take advantage of 
information about network characteristics, we expect that 
many other distributed applications over the Internet would 
be similarly successful. The resource that governs the mi- 
gration decisions of adaptalk is network latency. To 
provide latency information, we have developed Komodo, 
a distributed network latency monitor. 

To evaluate if mobile applications can take advantage 
of network-awareness, we examined the performance of 
adapta1k with and without mobility. Our evaluation had 
two main goals: (1) to determine the performance ben- 
efits, if any, of network-aware placement of the central 
chat server over a network-oblivious placement; and (2) to 
determine if dynamic placement based on online network 
monitoring provides significant performance gains over a 
one-time placement based on initial information. Our re- 
sults are encouraging - they indicate that on-line monitoring 
and dynamic placement can significantly improve perfor- 
mance of distributed applications on the Internet. 

The paper is organized as follows. Section 2 describes 
Sumatra and the programming model that it provides. Sec- 
tion3 describes the design and implementation of Komodo. 
Section 4 describes the adaptalk application and the 
policy it uses to make mobility decisions. Section 5 de- 
scribes our experiments and presents the results. Section 6 
discusses the results and their implications. Section 7 de- 
scribes related work and Section 8 provides our conclusions 
and plans for future work. 


2 Sumatra: a Java that walks 


Sumatra is an extension of the Java programming envi- 
ronment that supports adaptive mobile programs. Platform- 
independence was the primary rationale for choosing Java 
as the base for our effort. In the design of Sumatra, we 
have not altered the Java language. Sumatra can run all 
legal Java programs without modification. All added func- 
tionality was provided by extending the Java class library 
and by modifying the Java interpreter without affecting the 
virtual machine interface. 

Our design philosophy for Sumatra was to provide the 
mechanisms to build adaptive mobile programs. Policy 
decisions concerning when, where and what to move are 
left to the application. The main feature that distinguishes 
Sumatra from previous systems [3, 11, 13, 23] that support 
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mobile programs is that a// communication and migration 
happens under application control. Furthermore, combi- 
nation of distributed objects and thread migration allows 
applications the flexibility to dynamically choose between 
moving data or moving computation. The high degree of 
application control allows us to easily explore different pol- 
icy alternatives for resource monitoring and for adapting to 
variations in resources. We believe that the space of design 
choices for adaptive mobile programs is yet to be mapped 
out and such flexibility is important to help explore this 
space. 

Sumatra adds two programming abstractions to Java: 
object-groups and execution engines. An object-group is a 
dynamically created group of objects. Objects can be added 
to or removed from object-groups. All objects within an 
object-group are treated as a unit for mobility-related op- 
erations. This allows the programmer to customize the 
granularity of movement and to amortize the cost of mov- 
ing and tracking individual objects. This is particularly 
important in languages like Java because every data struc- 
ture is an object and moving the state one object at a time 
can be prohibitively expensive. An execution-engine is the 
abstraction of a location in a distributed environment. In 
concrete terms, it corresponds to an interpreter executing 
on a host. Sumatra allows object-groups to be moved be- 
tween execution-engines. An execution-engine may also 
host active threads of control. Currently, multiple threads 
on the same engine are scheduled in a run-to-completion 
manner. We plan to implement other scheduling strategies 
in future. Threads can move between engines. 


The principal new operations provided by Sumatra are: 


Object-group migration: Object-groups can be moved 
between engines on application request. As mentioned 
earlier, all objects within an object-group are treated as a 
unit for mobility-related operations. Objects in an object- 
group are automatically marshalled using type-information 
stored in their class templates. When an object-group is 
moved, all local references to objects in the group (stack 
references and references from other objects) are converted 
into proxy references which record the new location of the 
object. Some objects, such as I/O objects, are tightly bound 
to local resources and cannot be moved. References to 
such objects are reset and must be reinitialized at the new 
site. The class template for an object (and the associated 
bytecode) can be downloaded into an execution-engine on 
application request. 


Remote method invocation: Method invocations on 
proxy objects are translated into calls at the remote site. 
Type information stored in class-templates is used to 
achieve RPC functionality without a stub compiler. Ex- 
ceptions generated at the called site are forwarded to the 
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caller. Sumatra does not automatically track mobile ob- 
jects. Requesting a remote method invocation on an object 
that is no longer at the called site results in an ob ject-moved 
exception at the calling site. To facilitate application-level 
tracking, the exception carries with it a forwarding address. 
The caller can handle the exception as it deems fit (e.g., re- 
issue the request to the new location, migrate to the new 
location, raise a further exception and so on). This mech- 
anism allows applications to locate mobile objects lazily, 
paying the cost of tracking only if they need to. It also 
allows applications to abort tracking if need be and pursue 
an alternative course of action. 


Thread migration: Sumatra allows explicit thread migra- 
tion using a engine. go( ) function that bundles up the 
stack and the program counter and moves the thread to 
the specified execution-engine. Execution is resumed at 
the first instruction after the call to go. To automatically 
marshal the stack, the Sumatra interpreter maintains a type 
stack parallel to the value stack, which keeps track of the 
types of all values on the stack. When a thread migrates, 
Sumatra transports with it all local objects that are refer- 
enced by the stack but do not belong to any object-group. 
Objects that belong to an object-group move only when 
that object-group is moved. Stack references to the objects 
that are left behind (i.e were part of some object-group) are 
converted to proxy references. After the thread is moved 
to the target site, it is possible that its stack contains proxy 
references that point to objects that used to be remote but 
are now local. These references are converted back tolocal 
references before the call to go returns. 


Remote execution: A new thread of control can be created 
by rexec’ing the main method of a class existing on a re- 
mote engine. The arguments for new thread are copied and 
moved to the remote site. Unlike remote method invoca- 
tion, remote execution is non-blocking; the calling thread 
resumes immediately after the main method call is sent 
to the remote engine. Remote execution is different from 
thread migration as it creates a new thread at the remote site 
that runs concurrently with the original thread; thread mi- 
gration moves the current thread to the remote site without 
creating a new thread. Concurrent threads communicate 
using calls to shared objects. The thread initiating a remote 
execution can share objects with the new thread by passing 
it references to these objects as arguments to main. 


Resource monitoring: Sumatra provides a resource- 
monitoring interface which can be used by applications 
to register monitoring requests and to determine current 
values of specific resources. This interface is similar to 
an object-oriented version of the Unix ioct1 ( ) interface. 
When an application makes a monitoring request, Suma- 
tra forwards the request to the local resource monitor. If 


the monitor does not support the requested operation, an 
exception is delivered to the application. 


Signal handlers: Sumatra allows applications to regis- 
ter handlers for a subset of Unix signals. Signals can be 
used by the external environment (the operating system 
or some other administrative process) to inform the ap- 
plication about urgent asynchronous events, in particular 
resource revocation. Using a handler, the application can 
take appropriate action including moving away from the 
current execution site. 


2.1 Example 


In this section, we provide a feel for the Sumatra pro- 
gramming model using a simple example. The task is to 
scan through a database of X-ray images stored at a remote 
site for images that show lung cancer. This task can be 
performed in two steps. In the first step, a computation- 
ally cheap pruning algorithm is used to quickly identify 
lungs that mighthave cancer. A compute-intensive cancer- 
detection algorithm is then used to identify images that 
actually show cancer. 

One way to write a program for this task would be to 
download all lung images from the image server and doall 
the processing locally. Ifthe absence of cancer in most lung 
images can be cheaply established, this scheme wastes net- 
work resources as it moves all lung images to the destination 
site. Another approach would be to send the selection pro- 
cedure to the site of the image database and to send only the 
interesting" images back to the main program. If the selec- 
tion procedure is able to filter out most of the images, this 
approach would significantly reduce network requirements. 
A third, and even more flexible, approach would allow the 
shipped selection procedure to extract all the interesting 
images from the database but return only the size of the ex- 
tracted images to the main program. If the size is too big, 
the program may choose to move itself to the database site 
and perform the cancer-detection computation there rather 
than downloading all the data. This avoids downloading 
most images at the cost of (possibly) slower processing at 
the server. On the other hand if the size of the images is 
small, the data can be shipped over and processed locally. 
Figure 1 shows code for the third approach. This program 
makes its decision to migrate in a rudimentary fashion; a 
more realistic version of this application would also take 
network bandwidth and the processing power available on 
both machines into consideration. 

Sumatra assumes that a local resource monitor is avail- 
able which can be queried for information about the envi- 
ronment. In thenextsection, we describe one such monitor 
which allows Sumatra applications to request informa- 
tion about network latency between any pair of sites that 
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filter_object = new Lung_filter(); 


cancer_object = new Lung_checker(filter_object) ; 


myengine = System.rpc.myEngine(); 


/! Create a engine at the xray database site. 

remote_engine = new Engine("xrays.gov"); 

// Send the lung-filter class to the remote engine 
remote_engine.downloadClass("Lung_filter”); 
// Create anew object group. 


objgroup = new ObjGroup("lung_filter_group" ) ; 


// Add the lung_filter_object to the object group 
objgroup.checkIn(filter_object) ; 
// Move the object group to the database site 
objgroup.moveTo( remote_engine) ; 


// a remote method call selects interesting xrays 


size = filter_object.query(db, "DarkLungs"); 


// Are there too many images to bring over? 

if ( size > too.many_images ) { 
// Migrate thread, process images and retum. 
remote_engine.go(); 
result = cancer_object.detect_cancer()j; 
myengine.go(); 

} 

else { 
// there are only a few interesting xrays. Fetch them 
// and process locally. 
objgroup.moveTo(myengine); 
result = cancer_object.detect-_cancer()j; 


} 


// display result locally 
System.display(result); 


Figure 1: Excerpt of a Sumatra program that adaptively migrates to reduce its network bandwidth requirements 


run the monitor. 


3 Komodo: a distributed network latency 
monitor 


Komodo? is a distributed network latency monitor. The 
design principles of Komodo are: low-cost active mon- 
itoring and fault-tolerance. Active monitoring uses sepa- 
rate messages for monitoring; passive monitoring generates 
no new messages and piggybacks monitoring information 
on existing messages. An active monitoring approach is 
needed for adaptalk (described in the next section) as 
passive monitoring cannotprovide information about links 
that are not used inthe current placement but could be used 
in alternative placements. It is our working hypothesis that 
effective mobility decisions can be based on medium-term 
(30sec-few minutes) and long-term (hours) variations. At 
these resolutions, we believe that active monitoring can 
be achieved at an acceptable cost. This section briefly de- 
scribes the design and implementation of Komodo. Further 
details about Komodo are presented in [18]. 

Komodo allows applications to initiate monitoring of 
network latency between any pairofhostsrunningthe mon- 
itor; the application need not be resident on either of the 
hosts. Komodo is implemented as a user-level daemon that 
runs on every host participating in the computation. Ap- 
plications pass monitoring requests to their local Komodo 
daemon. If the requested link includes the current host, the 


2Komododragonsare a species of monitor lizards found on the island 
of Komodo which is close to both Java and Sumatra. 
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local daemon handles the request. Otherwise, it forwards 
the request to the daemon on the appropriate host. Dae- 
mons determine network latency by sending 32-byte UDP 
packets to each other. If an echo is not received within 
an expected interval, (the maximum of the ping period or 
five times the current round trip time estimate) the packet 
is retransmitted. Using UDP for communication may, oc- 
casionally, lead to loss of messages. Message loss can 
lead only to a short-term loss of efficiency. As we expect 
monitoring requirements to be coarse-grained, the effect of 
packet loss should be small. Note that message loss is also 
a sign of network congestion and as such may be useful 
information for applications. 

Applications that initiate a monitoring request can spec- 
ify the frequency with which Komodo pings a link. Ko- 
modo enforces an upper bound on this frequency to keep 
the monitoring cost at an acceptable level. Applications 
need to refresh requests periodically to keep them alive; 
Komodo deactivates requests that have not been refreshed 
for longer than its request-timeout period. 

Latency measures acquired by Komodo are passed 
through a filter before being provided to applications. This 
filter eliminates singleton impulses as well as noise withina 
Jitter threshold (we use a jitter threshold of 1@ ms, which is 
the resolution of most Unix timers). If the measure changes 
rapidly, a moving window average is generated. This filter 
was designed on the basis of our study of a large number 
of Internet latency traces (see Section 5.1) which revealed 
that: (1) there is a lot of short-term jitter in the latency 
Measures but in most cases, the jitter is small; (2) there 
are occasional sharp jumps in latency that appear only for 
short time intervals; (3) occasionally, the latency measure 
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baekdoo.cs.umd.edu and lanl.gov. Note that the four single-ping impulses towards the right end have been 
eliminated. (b) The CPU utilization is computed by dividing the (user+system) time by the total running time. Each 
experiment was run for 1000 seconds with one ping per second for all links. 


fluctuates rapidly; (4) for time windows of 10 seconds or 
larger, the mode value (with a JO ms jitter threshold) dom- 
inates. To elaborate the last point, in most time windows, 
70-90% of the latency values fall within a jitter threshold 
of the most common value. Our filter attempts to find the 
mode for a recent time window. If there is no stable mode 
(as happens occasionally), it returns the mean. Figure 2(a) 
illustrates the operation of the filter. 

Each daemon maintains a cache of current latency esti- 
mates for all links it is currently monitoring. This cache is 
maintained in a well-known shared memory segment and 
can be efficiently read by all Sumatra applications execut- 
ing on the same machine. Cooperating Komodo daemons 
forward latency information in response to persistent re- 
mote requests. A latency estimate for a request received 
from another host is forwarded only when a new filtered 
estimate (different from the previous filtered estimate) is 
generated and is piggybacked onto a ping reply if possible. 
Currently, Komodo is implemented in C. 

To address concems about the cost of active moni- 
toring, we measured the CPU utilization of Komodo for 
varying number of links. Results in Figure 2 (b) show 
that the maximum CPU utilization for up to sixteen links 
is about 0.5 %. The amount of data transferred is 512 
bytes/second. This experiment was conducted on Sparc 
5 machines (1 10MHz,32 MB of memory) running SunOS 
Release 5.5. 


1997 Annual Technical Conference 


4 Adaptalk: An adaptive internet chat ap- 
plication 


Adaptalk is a relatively simple network chat applica- 
tion built using Sumatra and Komodo. It allows multiple 
users to have an online conversation; new participants can 
join an ongoing conversation at any point; multiple in- 
dependent conversations can be held. To ensure that all 
participants see the same conversation and that new partic- 
ipants can join ongoing conversations, a central server is 
used to serialize and broadcast the contributions. 

Adaptalk is divided into three modules: handling 
keyboard events, managing the chat screen and coordinat- 
ing the communication between participants. Each com- 
ponent is implemented by a separate object-group. Each 
host participating in the conversation runs two execution- 
engines, one houses the screen object-group and the 
other houses the keyboard object-group. The central 
server is implemented as a separate shared object-group, 
the msgboard, which can be placed on any host partic- 
ipating in the conversation. Each message issued by a 
participant starts from a keyboard object which invokes a 
remote method on the msgboard. The msgboard serializes 
incoming messages and issues a series of remote-execution 
requests, one per participant, which update the screen ob- 
jects on all participants. In this case, remote execution is 
preferred to remote method invocation as there is no use- 





95 


96 


ful return value and remote execution allows fast one-way 
communication. 

Individual messages in adaptalk, and most other chat 
applications, consist of single lines of characters, usually no 
more than 50-60 characters. The goal of a chat application 
is to provide a short response-time to all participants so that 
aconversation can make quick progress. The response-time 
for a particular participant depends on the latency between 
itand the central server. Given the latencies of all the links, 
the primary knob that adaptalk can tum to maintain a 
low response-time for all participants is the position of the 
central server. 


4.1 Mobility policy 


There are two main features of the adaptalk mobil- 
ity policy. (1) continuous tracking of the instantaneously 
most-suitable-site and (2) deferral of server-motion till the 
potential for a significant and stable performance advan- 
tage has been seen. The first feature allows it to quickly 
take advantage of opportunities for optimization; the sec- 
onds helps ensure the gain is greater than the cost. The goal 
of adaptalk is to minimize the maximum response-time 
seen by any participant. The suitability of a participating 
machine as the location of the msgboard is characterized 
by the maximum network latency between it and all other 
participants. The machine that achieves the lowest measure 
is designated the most-suitable-site. 

Adaptalk’s migration policy is shown in pseudo-code 
in Figure 3. This algorithm is run at the location that hosts 
the msgboard and recomputes the most-suitable-site each 
time a new message is posted by any participant. The 
msgboard maintains an array of counters, one for each po- 
tential location, which keep track of the number of times 
each location is found to be the most-suitable-site. The 
msgboard moves whenever: (1) the current site receives a 
very low score (< loss_threshold) over a given period (the 
decision_cycle); or (2) a different site receives more than 
a threshold score (the win_threshold). The first condition 
is used to quickly move away from locations that provide 
poor performance; the second conditionis used tomove the 
msgboard to locations that consistently promise better per- 
formance. The counters are reset whenever the msgboard 
moves, the decision_cycle completes or a participant enters 
or leaves the conversation. 

We expect three types of variations in the network char- 
acteristics which may be cause for migration: (1) popula- 
tion variations, which represent changes in the distribution 
of users on the network, as participants join or leave an 
ongoing conversation; (2) spatial variations, i.e. stable 
differences between latencies of different links; and (3) 
temporal variations, i.e. changes in the latency of a link 
over a period of time. Adaptalk’s migration policy can 
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adapt to all three types of variations. Consider the case 
with a fixed number of participants with significant spatial 
variation in network latency and little temporal variation. 
In this case, the migration algorithm rapidly recognizes the 
best location for the msgboard, but waits until this choice 
has been ratified over some period of time (count[newloc] 
> win_threshold) before moving it. As shown in Section 5, 
this policy allows adaptalk to effectively insure itself 
against poor initial placement. Once a good location has 
been found, the msgboard does not move, unless tempo- 
ral variations or changes in population distribution cause 
another node to become a substantially better location (i.e. 
count[newloc} > win_threshold) or the current host to be- 
come a substantially bad choice (i.e. count[curr_engine] 
< loss_threshold && rounds % decision_cycle == 0). In 
such cases, the msgboard will move during the conversa- 
tion. After initial experiments with adaptalk, we set 
the win_threshold to be 25 x n, the loss_threshold to be 
12 x nand the decision_cycle to be 50 x n. Here, 7 is the 
number of participants. The length of the decision_cycle 
was set large enough to amortize the cost of movement 
in cases where large temporal variations or fluctuations in 
population distribution cause frequent repositioning. 


Get the all to all latency map from Komodo; 
Find the site s that would minimize the max 
latency for messages posted to msgboard; 
count[s] = count[s] + 1; rounds++; 
let w be the site with the largest count; 
let curr_engine be the engine which 
currently houses msgboard; 
// Found a clear cut winner. 
if (count[w] > win_threshold) return w; 
else if (rounds % decision_cycle == 0) { 
// Is the current engine an ok location ? 
if (count[curr_engine] > loss_threshold) { 
clear count for each host; 
return curr_engine; 
} else { 
// Current engine is a bad location. 
set new_host to the host with the 
Maximum count; 
clear count for each host; 
return new_host; 
) 


} else return null; // cycle not yet over. 


Figure 3: Decision Algorithm for msgboard placement 


used in Adaptalk. This algorithm is run at the location 
where the msgboard resides each time a message is posted. 
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5 Evaluation 


To evaluate the performance impact of network-aware 
adaptation on the Internet, we performed two sets of exper- 
iments. First, we monitored round-trip times for 32-byte 
ICMP packets between a large set of host-pairs over several 
days. The goal of these experiments was to study the spatial 
and temporal variation in network latency on the Internet. 
Results from this study are presented in section 5.1. 

Second, we measured the performance of three versions 
of adaptalk over long-haul networks, using traces col- 
lected during the Intemmet study. Our evaluation had two 
main goals: (1) to detennine if network-aware placement 
of components of an application distributed over multiple 
hosts on the Internet provides significant performance gains 
over a network-oblivious placement; and (2) to determine 
if dynamic placement based on online network monitor- 
ing provides significant performance gains over a one-time 
placement based on initial infonnation. Results from this 
study are presented in section 5.3. 


5.1 Variations in Internet latency 


We selected 45 hosts: 15 popular . com web-sites (US), 
15 popular . edu web sites (US) and 15 well-known non- 
US hosts. These host were pinged from four different 
locations in the US. The study was conducted over several 
weekdays, each host-pair being monitored for at least 48 
hours. We used the commonly available ping program and 
sent one ping per second. This resolution was acceptable 
as our goal was to discover medium-tern (30sec/minutes) 
and long-term (hours) variations. 

The conclusions of our study, briefly, are: (1) there is 
large spatial variation in Internet latency (the per-hour mean 
latency varied between 15 ms and 863 ms for US hosts and 
between 84 ms and 4000 ms for non-US hosts); (2) there is a 
large and stable variation in the latency of a single host-pair 
over the period of a day (maximum daily variation in per- 
hour mean latency for US hosts was 550 ms and for non-US 
hosts was 5750 ms); (3) There is a lot of jitter in the latency 
measures but in most cases, the jitter is small; (4) there 
are isolated peaks in latency that appear only for a single 
time interval; (5) for time windows of 10 seconds or larger, 
the mode value (with a 10 ms jitter threshold) dominates 
(in most time windows, 70-90% of the latency values fall 
within a jitter threshold of the most common value); (6) the 
moving-window mode changes quite slowly. 


5.2 Experimental Setup 


Having established that there are significant spatial and 
temporal variations in network latency on the Internet, we 


examined how well adaptalk could adapt to these vari- 
ations. 

To simulate the characteristics of long-haul networks, 
we decided to run our experiments over a low-latency 
LAN and delay all packets based on the ICMP ping 
traces described above (see Figure 4 (a)). This ap- 
proach also allowed us to perforin repeatable experi- 
ments. To ensure that delaying packets instead of us- 
ing a real network does not skew the latency measures, 
we performed a simple test. Free-running Komodo mon- 
itors were installed at bookworm.cs.umd.edu and 
jarlsberg.cs.wisc.edu and were used to collect 
UDP latency measures between this host-pair. In paral- 
lel, a trace of ICMP ping times between these two hosts 
over the same period (5000 sec) was collected. This trace 
was later fed into trace-driven Komodo monitors running 
on two hosts on our LAN. The latency measures reported 
by the trace-driven monitors matched quite well with the 
actual latency measures reported by free-running monitors. 
The average of the actual latency measures was 128 ms 
(std dev = 64); the average of the values reported by the 
trace-driven monitors was 144 ms (std dev = 68). 

We perforined all our experiments on four Solaris 
machines on our LAN. We picked six trace-segments 
from the Internet study and used them to delay pack- 
ets between the machines. All these segments were 
over the noon-2pm EDT period. We selected this pe- 
Tiod since noon is the approximate beginning of the 
daily latency peak for US networks as well as the ap- 
proximate end of the daily latency peak for many non- 
US networks. These traces were selected to approxi- 
mate the network latency spectrum observed in the In- 
ternet study. Hosts participating in the selected traces 
include: java.sun.com, home.netscape.com, 
www.opentext.com, cesdis.gsfc.nasa.gov, 
www.monash.edu.au and www.ac.il. This setup 
makes the four local machines behave like four far-flung 
machines on the Internet. Figure 4 (b) shows the configu- 
ration used for the experiments. 


5.3 Experiments 


We perfonned a series of experiments to evaluate the 
benefits of adapting to various types of network varia- 
tions. The experiments consisted of running three dif- 
ferent versions of the chat server. The first version, 
called static-placement, had no migration support and no 
network-awareness. The location of the msgboard was 
chosen in a network-oblivious fashion. The second version 
was a stripped-down version of adaptalk, called one-shot- 
placement. It used network infonnation from Komodo to 
find the best initial placement for the msgboard, and used 
mobility support to move it there. After initial placement, 
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Figure 4: Experimental Setup. Four local machines on a LAN were used to simulate four remote machines on the Internet 
by adding delays to packets. ICMP ping traces between real Internet hosts were used to generate the delays, so as to capture 


real-life temporal variations in latencies. 


migration decisions and network-awareness were tured 
off. The third version, called dynamic-placement, was the 
full-fledged adaptalk, as described in section 4. It used 
on-line monitoring and dynamic placement to position the 
msgboard. 


The performance of static-placement depends on the lo- 
cation of the msgboard. If static-placement chooses the 
same location as one-shot-placement, both would have 
the same performance. On the other hand, since static- 
placement is network-oblivious, it is just as likely to place 
the msgboard at the worst possible location. As the per- 
formance of one-shot-placement already provides a rough 
upper-bound on the performance of static-placement, we 
deliberately chose the worst initial placement when run- 
ning the static-placement version. 


Adapting to Population Variation: To evaluate the ef- 
fect of changing user distribution we used the following 
workload: A conversation was initiated between hosts C’ 
and D. Host B joins the conversation after 15 minutes, 
and host A joins 15 minutes thereafter. Each host sends a 
sequence of 70-character sentences with a 5-second think 
time between sentences. With only two hosts initiating 
the conversation, there is no difference between the best 
and worst initial placements for the msgboard and both 
Static-placement and one-shot-placement perform identi- 
cally (both place the msgboard on host D). Figure 5 (a) 
plots the maximum latency over all hosts for the one-shot- 
placement version. Note that even after new hosts join 
the conversation there is no noticeable difference in maxi- 
mum latency. continues to In contrast, dynamic-placement 
adapts to the changing population. Soon after host B joins 
the conversation, the adaptive placement policy moves the 
msgboard there, causing a drop in the maximum latency. 
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After host A joins the conversation, the msgboard moves 
between hosts A and B in response to temporal fluctua- 
tions. This can be seen from the variation in latency for 
host B in Figure 5 (b). These movements help keep the 
maximum latency steady even in the presence of temporal 
fluctuations. 


Adapting to Temporal and Spatial Variation: In this case 
the client population is assumed to be stable. The work- 
load consists of all 4 hosts jointly initiating a conversation 
which runs for 75 minutes. As before, each host generates 
a new sentence every 5 seconds. In this case, the network- 
oblivious (static-placement) version places the chat server 
on host D. The network-aware (one-shot-placement) ver- 
sion uses latency information provided by Komodo to de- 
termine that host B is a much better placement. For the 
dynamic-placement version, initial placement is less im- 
portant as it should be able to recover from a bad initial 
placement. For this version, we place the msgboard at host 
D, the worst-possible location. 


To avoid clutter, Figure 6 shows the performance of 
these three versions in two different graphs. Figure 6 (a) 
compares the maximum latency (over all participants) for 
the dynamic-placement and static-placement versions. As 
seen from the sharp drop on the left end of the graph, the 
dynamic-placement version is successfully able to move 
the msgboard away from its bad initial placement to a 
more suitable location. Figure 6 (b) compares the average 
maximum latency (over all participants) for the dynamic- 
placement and one-shot-placement versions. It shows that 
once the dynamic-placement version moves the server to 
a more suitable location, the performance of the two ver- 
sions is largely equivalent. This implies that adapting to 
short-term temporal variations in a steady population work- 
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Figure 5: Adapting to population variation. Hosts C’ and D initiate the conversation. Host B joins after 900 seconds and 
host A joins 1800 seconds after the beginning. The one-shot-placement version places the chat server at host D. The 
dynamic-placement version migrates the server when new hosts join. 


load does not provide much performance advantage over 
one-shot network-aware placement. It may, however, still 
be advantageous to adapt to long-term temporal variations. 
Note that at the far right of graph Figure 6 (b), temporal vari- 
ation in the link latencies do allow the dynamic-placement 
version to do better than the one-shot-placement version. 


6 Discussion 


In the introduction, we raised three questions with re- 
spect to network-aware mobility. First, how should pro- 
grams be structured to utilize mobility to adapt to varia- 
tions in network characteristics? Second, is the variation in 
network characteristics such that adapting to them proves 
profitable? Finally, can adequate network information be 
provided to mobile applications at an acceptable cost? 

Our experience with Sumatra and adaptalk provides 
some early insights about application structure suitable for 
adaptive mobile programs. First, the migration policy 
should be cheap so that applications don’t have to analyze 
the tradeoffs of the migration decision itself. An easy-to- 
compute policy allows frequent decisions and rapid adap- 
tation to changes in the environment. We believe that an 
easy-to-compute migration policy was key toadaptalk’s 
ability to quickly find good locations for the chat server. 


Second, good modularization helps an application take ad- 
vantage of mobility. Modularizationis important for all dis- 
tributed applications but it is more so for mobile programs 
as they have to make online decisions about the placements 
of different components. Third, to be resource-aware, re- 
mote accesses should be split-phase; the first phase delivers 
an abbreviation, a small and cheaply computed metric of 
the data (for example, size, number of data items, thumb- 
nail sketch etc) and the second phase actually accesses the 
data. This allows the application to change its data access 
modality for the second phase (retrieve remotely, request 
filtering, move to data location) based on the value of the 
abbreviation and knowledge of its own requirements. This 
insight comes from our experience with writing other ap- 
plications in Sumatra; adaptalk does not benefit from 
this as the size of all messages is small. 
Animportantquestion that needs further investigation is 
where the control for mobility decisions should be placed 
— whether mobility decisions should be made by a central 
controller that keeps track of the state of all links or by 
multiple local controllers that use information only from a 
small subset of the links. Centralized decisions are likely 
to be more expensive than distributed decisions (the latter 
need less information and less synchronization) but could 
yield better performance (as they use global information). 
To answer the second question, we evaluated the prof- 
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Figure 6: Maximum latency (over all participants) vs time in adaptalk. The one-shot-placement and the static-placement 
are computed based on latency information available when the conversation is initiated. The client population is stable 


throughout the experiment. 


itability of adapting to changes in the user-distribution as 
well as spatial and temporal variations in network latency. 
Adapting to changes in user-distribution led to significant 
gains allowing adaptalk to find better placements as 
more users came online. Support for mobility allows appli- 
cations built around a central data-structure torecover from 
a poor initial placement of this structure by repositioning it 
to a more suitable location. Adapting to temporal variations 
alone did not not lead to significant benefits over the period 
of an hour. In light of this experience, we expect that a 
simpler migration policy for adaptalk for short periods 
would consider migration only when users join or leave the 
conversation, rather than on every message as is currently 
done. Since long-term variation of latency could be as large 
as 550 ms (US hosts) and 5750 ms (non-US-hosts), longer 
conversations could still benefit from adapting to temporal 
variations. 


Our experiments with Komodo illustrate that cheap ac- 
tive monitoring can provide network information that can 
be profitably exploited. Though it would be best to use 
Komodo as a stand-alone system supplying network infor- 
mation to many distributed applications, its cost is so low 
that one can contemplate rolling Komodo into individual 
applications such as adaptalk without overloading the 
network. Active monitoring was needed for adaptalk 
as itneeded information about links that are not used in the 
current placement but could be used in alternative place- 
ments. Other applications that change the location of com- 
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putation but do not change the pattern of communication 
would not need active monitoring as they could piggyback 
monitoring information on existing messages. An example 
of such an application would be an information access pro- 
gram on a mobile platform which moves primarily between 
this platform and a proxy host on the static network. Ac- 
tive monitoring, as implemented in Komodo, will not be as 
cheap for applications that are bandwidth-sensitive and not 
latency-sensitive. We are currently investigating methods 
to cheaply estimate Internet bandwidth. 

In this paper, we have considered Internet hosts that 
are static. If the platform is mobile and is able to switch 
between multiple wireless networks [14], the temporal vari- 
ation in latency could be greater and more abrupt. In these 
cases, adapting to short-term temporal variations could pro- 
vide a significant benefit even for latency-sensitive appli- 
cations. 

System stability is a potential concern for programs 
whose components are mobile. We believe that system 
stability is a property of the application and not the un- 
derlying system support. Accordingly, Sumatra does not 
provide automatic tracking. Instead, it provides support (in 
the form of object-moved exceptions) that allows applica- 
tions to track mobile objects (see section 2 for details). We 
have not yet encountered stability problems in any of our 
applications. 

Finally, we would like to argue the need for mobility 
as an adaptation mechanism. An alternative adaptation 
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mechanism, which places replicated servers at all suitable 
points in the network, could adapt to spatial, temporal and 
population variation by handing off control between servers 
and by using dynamically created hierarchies of servers. 
It is quite likely that for any particular application, such a 
strategy would be able to achieve the performance achieved 
by programs that use program mobility as the adaptation 
tool. The advantage of mobility-based strategies is that 
it allows small groups of users to rapidly set up private 
communities on-demand without requiring extensive server 
placement. 


7 Related work 


Process migration and remote execution have been pro- 
posed, and have been successfully used, as mechanisms for 
adapting to changes in host availability [5, 7, 15, 21, 24]. 
Remote execution has also been proposed for efficient ex- 
ecution of computation that requires multiple remote ac- 
cesses [6, 8, 22] and for efficient execution of graphical user 
interfaces which need to interact closely with the client [2]. 
Both these application scenarios use remote execution as 
a way to avoid using the network. Most proposed uses 
of Java [10] also use remote execution to avoid repeated 
client-server interaction. In these applications, decisions 
about the placement of computation are hardcoded. To the 
best of our knowledge, Sumatra (together with Komodo) is 
the first system that allows distributed applications to mon- 
itor the network state and dynamically place computation 
and data in response to changes in the network state. We 
also believe that our experiment with adaptalk is the 
first attempt to determine if the variation in Internet charac- 
teristics is such that it is profitable for applications to adapt 
to them. 

Network-awareness is particularly important to appli- 
cations running on mobile platforms which can see rapid 
changes in network quality. Various forms of network- 
awareness have been proposed for such applications. 
Application-transparent or system-level adaptation to vari- 
ations in network bandwidth has been successfully used 
by the designers of the Coda file system [17] to improve 
the performance of applications. The Odyssey project on 
mobile information access plans to provide support for 
application-specific resource monitoring and adaptation. 
The primary adaptation mechanism under consideration is 
change in data fidelity [20]. Athan and Duchamp [1] pro- 
pose the use of remote execution for reducing the commu- 
nication between a mobile machine and the static network. 
In all these systems, location of the various computation 
modules is fixed; adaptation is achieved by changing the 
way in which the network is used. 
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Several systems have been built which permit an execut- 
ing program to move while it is in execution - for example 
Obligq [3], Agent TCL [11], Emerald [13], Telescript [23] 
and TACOMA [12].The primary distinction between these 
systems and Sumatra is that in Sumatra, a/l communication 
and migration happens under application control. Com- 
plete application control allows us to easily explore dif- 
ferent policy alternatives for resource monitoring and for 
adapting to variations in resources. 

Several studies have been performed to determine end- 
to-end Internet performance. Sanghi et al [19] and Mukher- 
Jee [16] have studied network latency. Their observations 
show that while round trip times show significant variabil- 
ity with sharp peaks, there exist dominant low frequency 
components. This is consistent with our observations that 
in atime window of reasonable size, the mode value usually 
dominates and that the mode value changes slowly. 

Golding [9] and Carter and Crovella [4] have studied 
mechanisms to estimate end-to-end Internet bandwidth. 
Golding’s results indicate that attempts to predict band- 
width using previous observations alone is unlikely to work 
well. Carter and Crovella propose the use of round trip 
times for short packets to estimate network congestion. 
They propose to use the network congestion information to 
estimate changes in network bandwidth (assuming the in- 
herent bandwidth of the link has been previously computed 
by flooding the link). Their results indicate that it might 
be possible to estimate the change in network bandwidth 
using information about the change in network latency. 


8 Conclusions and future work 


This paper is a first step in demonstrating that distributed 
programs can use mobility as a tool to adapt to variations in 
their operating environment. Our exploration of network- 
aware mobile programs lead us to the following conclu- 
sions. First, network-aware placement of components of a 
distributed applicationcan provide significant performance 
gains over a network-oblivious placement. For short term 
applications (applications that run for an hour or so), ex- 
ploiting spatial variations as well as variations in the num- 
ber and location of the clients achieves most of the gains. 
For longer-running applications, exploiting temporal vari- 
ations might be worthwhile. Second, effective mobility 
decisions can be based on coarse-grained monitoring. This 
allows cheap active monitoring without losing effective- 
ness. Finally, there is significant spatial and temporal vari- 
ation in Internet latency which can be effectively adapted 
to by mobile programs. 

We believe that there is a class of long running appli- 
cations over the Internet for which resource-aware mobil- 
ity could provide flexibility and performance which would 


1997 Annual Technical Conference 101 


take a lot more effort to achieve by other means. One future 
direction we would like to pursue is to identify such appli- 
cations and understand their structure and requirements. 
Some of the examples we intend to study include network- 
bandwidth-aware data-combination on the Intermet, cus- 
tom combination of periodically generated large-volume 
datasets (such as weather information) and resource-aware 
pre-fetching for web clients. 

Another direction that we plan to explore is efficient 
distributed monitoring of other resources, in particular net- 
work bandwidth and server availability. We are investigat- 
ing cheap methods of estimating network bandwidth. An 
important question that we are investigating ishow accurate 
resource estimates need to be in order to benefit from re- 
source aware mobility and how the accuracy of estimation 
affects performance. 

Network-awareness is very important to applications 
running on mobile platforms which can see rapid changes 
in network quality. The nature of these variations is sig- 
nificantly different from those that we have seen in our 
Internet study. We are planning to extend our work to mo- 
bile platforms. In particular, we would like to understand 
the monitoring requirements and the mobility algorithms in 
a mobile computing environment. 

The focus of our work has been to make distributed 
applications achieve better performance using mobility as 
a tool to adapt to resource variations. We have therefore 
not as yet addressed the important issues of security and 
resource-use containment in our implementation. We plan 
to look into both these issues. 
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Abstract 

Individual machines are no longer sufficient to handle the 
offered load to many Internet sites. To use multiple ma- 
chines for scalable performance, load balancing, fault trans- 
parency, and backward compatibility with URL naming 
must be addressed. A number of approaches have been de- 
veloped to provide transparent access to multi-server Inter- 
net services including HTTP redirect, DNS aliasing, Magic 
Routers, and Active Networks. Recently however, portable 
Java code and lightly loaded client machines allow the mi- 
gration of certain service functionality onto the client. In 
this paper, we argue that in many instances, a client-side ap- 
proach to providing transparent access to Internet services 
provides increased fiexibility and performance over the ex- 
isting solutions. We describe the design and implementation 
of Smart Clients and show how our system can be used to 
provide transparent access to scalable and/or highly avail- 
able network services, including prototypes for: telnet, 
FTP, and an Internet chat application. 


1 Introduction 


The explosive growth of the World Wide Web is 
straining the architecture of many Intemetsites. Slow 
response times, network congestion, and “hot sites 
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of the day” being overrun by millions of requests 
are fairly commonplace. These problems will only 
worsen as the Web continues to experience rapid 
growth. As aresult, it has become increasingly impor- 
tant to design and implement network services, such 
as HTTP, FTP, and web searching services, to scale 
gracefully with offered load. Such scalable services 
must, at minimum, address the following issues: 


e Incremental Scalability — If the offered load be- 
gins to exceed a service’s hardware capacity, it 
should be a simple operation to add hardware 
resources tO transparently increase system ca- 
pacity. Further, a service should be able to 
recruit resources to handle peaks in the load. 
For example, while the US Geological Sur- 
vey Web site (http: //quake.usgs.gov) 
is normally quite responsive, it was left com- 
pletely inaccessible immediately after a recent 
San Francisco Bay Area earthquake. 


e Load Balancing ~ Load should be spread dynam- 
ically among server resources so that clients re- 
ceive the best available quality of service. 


e Fault Transparency — When possible, the service 
should remain available in the face of server and 
network upgrades or failures. 


e Wide Area Service Topology —- Individual 
servers comprising a service are increasingly 
distributed across the wide area [Net 1994, Dig 
1995]. The server machines that comprise a 
network service should not be required to have a 
restricted or static topology. In other words, all 
servers should be allowed to arbitrarily migrate 
to other machines. 


e Scalable Service To Legacy Servers — Adding 
scalability to existing network services such as 
FTP, Telnet, or HTTP shouldnotrequire mod- 
ifications to existing server code. 
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Unfortunately, providing these properties for net- 
work services while remaining compatible with the 
de facto URL (Uniform Resource Locator) naming 
scheme has proven difficult. URLs are by definition 
location dependent andhence are a single point of fail- 
ure and congestion. A number of efforts address this 
limitation by hiding the physical location of a par- 
ticular service behind a logical DNS hostname. Ex- 
amples of such systems include HTTP redirect, DNS 
Aliasing, Failsafe TCP, Active Networks, and Magic 
Routers. 

We argue that in many cases the client, rather than 
the server, is the right place to implement transparent 
access to network services. We will describe limita- 
tions associated with each of the above solutions and 
demonstrate how these limitations can be avoided by 
moving portions of server functionality onto the client 
machine. This approach offers the advantage of in- 
creased flexibility. For example, clients aware of the 
relative load on a number of FTP mirror sites can con- 
nect to the least loaded mirror to deliver the highest 
throughput to the end user. Ideally, the selection and 
connection process takes place without any interven- 
tion from the end user, unlike the Web today where 
users must choose among FTP mirror sites manually. 
Note that in this example, clients must take into ac- 
count available network bandwidth to each mirror site 
as well as the relative load of the sites to receive op- 
timal performance. Such flexibility would be difficult 
to provide with existing server-side solutions since in- 
dividual servers may not have knowledge of mirror 
site group membership and client location. 

The migration of service functionality onto client 
machines is enabled by tworecent developments. To- 
day, most popular Internet services, such as FTP, 
HTTP, and search engines are universally accessed 
through extensible Web browsers. This extensibility 
allows insertion of service-specific code onto client 
machines. In addition, the advent of Java [Gosling 
& McGilton 1995] allows such code to be easily dis- 
wibuted to multiple platforms. Next, network latency 
and bandwidth are increasingly the bottleneck to the 
performance delivered to clients. Thus, client proces- 
sors can be left relatively idle. We will demonstrate 
that offloading service functionality onto these idle 
cycles can substantially improve the quality of service 
along the axis described above. 

Motivated by the above observations, we describe 
the design and implementation of Smart Clients to 
support our argument for client-side location of code 
for scalability and transparency. The central idea be- 
hind Smart Clients is migrating server functionality 
to the client machine to improve the overall qual- 
ity of service in the ways described above. This ap- 
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proach contrasts the traditional “thin-client” model 
where clients are responsible largely for displaying 
the results of server operations. While our approach 
is general, this paper concentrates on augmenting the 
client-side architecture to provide benefits such as 
fault transparency and load balancing to the end user. 

The rest of this paper is organized as follows. Sec- 
tion 2 discusses existing solutions to providing scal- 
able services. The limitations of the existing solutions 
motivates the Smart Client architecture, described in 
Section 3. Section 4 demonstrates the utility of the ar- 
chitecture by describing the implementation and per- 
formance of interfaces for telnet, FTP, and a scal- 
able chat service. Section 5 evaluates our require- 
ments above in the context of the Smart Client archi- 
tecture. Section 6 describes related work, and Sec- 
tion 7 concludes. 


2 Alternative Solutions 


Existing architectures include DNS Aliasing [Brisco 
1995, Katz et al. 1994], HTTP redirect [Berners-Lee 
1995], Magic Routers [Anderson et al. 1996], fail- 
safe TCP [Goldstein & Dale 1995], and Active Net- 
works [Wetherall & Tennenhouse 1995]. Figure 1 de- 
scribes how Smart Clients fits in the space of existing 
solutions. We will describe each of the existing so- 
lutions in turn leading to a description of the Smart 
Client architecture. 

A numberof Web servers use Domain Name Server 
(DNS) aliasing to distribute load across a number of 
machines cooperating to provide a service. A single 
logical hosmame for the service is mapped onto mul- 
tiple IP addresses, representing each of the physical 
machines comprising the service. When a client re- 
solves a hosmame, alternative IP addresses are pro- 
vided in a round-robin fashion. DNS aliasing has been 
demonstrated to show relatively g00d load balancing, 
however the approach also has anumber of disadvan- 
tages. First, random load balancing will not work as 
well for requests demonstrating wide variance in pro- 
cessing time. Second, DNS aliasing cannot account 
for geographic load balancing since DNS does not 
possess knowledge of client location/server capabil- 
ities. 

On a client request, HTTP redirect allows a server 
to instruct the client to send the request to another lo- 
cation instead of retuming the requested data. Thus, 
a server machine can perform load balancing among 
a number of slave machines. However, this approach 
has a number of limitations: latency to the client is 
doubled for small requests, a single point of failure 
is still present (if the machine serving redirects is un- 
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Figure 1: This figure describes the design space for providing transparent access to scalable network services. 
Transparency mechanisms can be implemented in a number of places, including the client, network, network 


routers, or at the service site. 


available, the entire service appears unavailable), and 
servers can still be overloaded attempting to serve 
redirects. Further, this mechanism is currently only 
available for HTTP; it does not work with legacy ser- 
vices nor does it optimize wide-area access. 


The Magic Router provides transparent access by 
placing a modified router on a separate subnet from 
machines implementing a service. The Magic Router 
inspects and possibly modifies all IP packets before 
routing the packets to their destination. Thus, it 
can perform load balancing and fault transparency 
by mapping a logical IP address to multiple server 
machines. If a packet is destined for the designated 
service IP address, the Magic Router can dynami- 
cally modify the packet to be sent to an alternative 
host. Unresolved questions with Magic Routers in- 
clude how much load can be handled by the router 
machine before the dynamic redirection of the pack- 
ets becomes the bottleneck (since it must process ev- 
ery packet destined for a particular subnet). Magic 
Routers also require a special network topology which 
may not be feasible in all situations. Finally, the 
Magic Router is not aware of the load metrics relevant 
to individual services, i.e. it would have to perform 
remappings based on a generic notion of load such as 
CPU utilization. 


Fail-safe TCP replicates TCP state across two in- 
dependent machines. On a server failure, the peer 
machine can transparently take over for the failed 
machine. In this fashion, fail-safe TCP provides 
fault transparency. However, it requires a dedicated 
backup machine to mirror the primary server, and it 
does not address the problem of the front-end be- 
coming a bottleneck. Finally, both fail-safe TCP and 


Magic Routers are relatively heavy-weight solutions 
requiring extra hardware. 

Proposals for Active Networks allow for computa- 
tion to take place in network routers as packets are 
routed to their destination. This approach could po- 
tentially provide fault transparency and load balanc- 
ing functionality inside of the routers. We believe 
Active Networks, if widely deployed, can provide a 
mechanism for implementing Smart Client function- 
ality. 

All of the above solutions provide a level of trans- 
parent access tonetwork services with respect to load 
balancing and fault transparency. However, they are 
all limited by the fact that they are divorced from 
the characteristics and implementations of individ- 
ual services. We observe that the greatest function- 
ality and flexibility can often be provided by adding 
service-specific customization to the client, rather 
than service-independent functionality on the server. 


3 Smart Client Architecture 


In this section, we describe how the Smart Client ar- 
chitecture allows for the construction of scalable ser- 
vices. For the purposes of this paper, we assume the 
service is implemented by a number of peer servers, 
each capable of handling individual client requests’. 
The key idea behind Smart Clients is the migration 
of certain server functionality and state to the client 
machine. This approach provides a number of advan- 
tages: (i) offloading server load and decreasing imple- 


'This assumption holds for many widely-used Internet services 
such as HTTP, FTP, and Web searching services. 
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Figure 2: This Figure describes the Smart Client service access model. Two service-specific Java applets are sup- 
plied to mediate server access. The client interface applet provides the interface to the user and makes requests 
of the service. The director applet is responsible for providing transparency to the client applet; it makes server 
requests to the appropriate (e.g. least loaded) server, and updates its notion of server state. 


mentation complexity, (ii) allowing clients to utilize 
multiple peer servers distributed across the wide area 
without the knowledge of individual servers, and (iii) 
improving the load distribution and fault transparency 
of the service as a whole. 


When a user wishes to use a service, a bootstrap- 
ping mechanism is used to retrieve service-specific 
applets designed to access the service. Two cooper- 
ating applets, a client interface applet and a direc- 
tor applet, provide the interface and mask the details 
of contacting individual servers respectively. Client- 
side functionality is partitioned in this fashion to sep- 
arate the service’s interface design from the mecha- 
Nisms necessary to deliver client requests to servers in 
a load-balanced, fault tolerant manner. 


The client mterface applet is responsible for ac- 
cepting user input and packaging these requests to 
the director applet. The director applet encapsulates 
knowledge of the service member set and the service- 
specific meta-information allowing the director ap- 
plet to send requests to the appropriate server. For 
every user request, the Smart Client uses the direc- 
tor applet to invoke a service-specific mechanism for 
determining the correct destination server for the re- 
quest. Figure 3 shows the interaction of the two ap- 
plets in a Java-enabled Web browser. A number of is- 
sues are associated with this approach: naming mech- 
anisms for choosing among machines implementing 
a service, procedures for receiving updates with new 
information about a service (e.g., changes in load, or 
the availability of anew machine), and bootstrapping 
retrieval of the Smart Client applets. We will discuss 
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each of these issues in turn leading to a description of 
the Smart Clients API. 


3.1. Transparent Service Access 
3.1.1. Load Balancing and Fault Transparency 


We begin our discussion of the Smart Client architec- 
ture by describing the techniques used to provide load 
balanced and fault tolerant access to network services. 
Discussion of bootstrapping the retrieval of the Smart 
Client is deferred until Section 3.2. We assume that 
services accessed by Smart Clients are implemented 
by a number of peer servers. In other words, any 
of a list of machines are capable of serving individ- 
ual client requests. Thus, the director applet makes 
a service-specific choice of a physical host to contact 
based on an internal list of (dynamically changing) 
server sites. Ideally, this choice should balance load 
among servers while maximizing performance to the 
end user. 

While the choice ofload balancing algorithm is ser- 
vice specific, we enumerate a number of sample tech- 
niques. The simplest approach is to randomly pick 
among service providers. While this approach is sim- 
ple to implemement and does not require server mod- 
ifications, it can result in both poor load balancing 
and poor performance to the end user. For example, 
an FTP applet picking randomly among a list of ser- 
vice providers may pick an under-poweredmizror site 
on another continent. A refinement on random load 
balancing would bias future random choices based on 
how quickly requests to a particular server are pro- 
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Figure 3: This Figure describes the Smart Client architecture. (1) The client interface applet first makes a request 
to the director applet. (2) The director applet, given outside information such as load and changes to the service 
group membership, picks the best node to apply the request to. The director will also re-apply the request if the 
operation fails. (3) The result of the operation, including a success/failure code, is retumed to the client interface 


applet. 


cessed. For services where multiple successive re- 
quests are likely, we believe this technique should re- 
sult in good performance while maintaining imple- 
mentation simplicity, 

Another technique involves maintaining service- 
specific profiles of servers. In the FTP example 
above, a description of hardware performance and 
network connectivity (perhaps using techniques simi- 
lar to the Internet Weather Report [Mat 1996}) may be 
associated with each server. The Smart Client director 
applet can then use this information to evaluate avail- 
able bandwidth to each server based on the client's 
location. A further improvement requires maintain- 
ing load information for each server. In this case, the 
client is able to maximize performance by weighing 
a combination of network connectivity, server perfor- 
mance, and current server load. 


The mechanisms used for load balancing can be 
adapted to provide fault transparency to the end user. 
Techniques such as keep-alive messages or time outs 
can be utilized to determime server failure. Upon fail- 
ure, the director applet can reinvoke the load balanc- 
ing mechanism to choose an alternate server and reap- 
ply the request. By storing all uncompleted server re- 
quests and necessary Client state information, the di- 
rector applet can connect to an alternative site to re- 
transmit all outstanding requests transparently to the 
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user. 


3.1.2 Updating Applet State 


In order to make load balancing decisions, the client 
may need a reasonably current profile of the individ- 
ual servers providing the service. Depending on the 
application, updating the director of changes in ser- 
vice state can be achieved througheither lazy or eager 
techniques, presenting both performance and seman- 
tic tradeoffs for maintaining consistency. 

Examples of eager update techniques include client 
polling and server callbacks. Using client polling of 
servers to maintain load information has the disad- 
vantage of severely loading server machines. Server 
callback techniques can be more scalable than client 
polling, however they require server modifications 
and increase implementation complexity. Neither 
client polling nor server callbacks is likely to scale to 
the level of thousands of clients necessary for some 
Web services. Eager update methods are appropriate 
when accurate information is required and the scale of 
the service is small enough to support eager protocols. 

Lazy update techniques [Ladin et al. 1992] are 
likely to be more appropriate in the context of the 
Web. Lazy updates reduce network traffic by sending 
information only occasionally, after a number of up- 
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dates have been collected. One particularly attractive 
mode of lazy updates is piggy-backing update infor- 
mation with server replies to client requests. For ex- 
ample, a server can inform Smart Client directors ap- 
plets of the addition of new server machines compris- 
ing the service when replying to a director request. 


3.1.3 Director Architecture 


Smart Clients provide a very flexible mechanism 
for implementing service-specific transparency. The 
Smart Client director provides the illusion of a single, 
highly-available machine to the programmer of the 
client interface applet. Requests made by the Smart 
Client client interface applet are written to operate on 
a single machine. The director applet chooses the des- 
tination server based on service-specific information 
such as load, availability, processor speed, or connec- 
tion speed. 

The director accepts arbitrary requests of the form 
“perform this action on a server node”. The direc- 
tor applet sends the request to the server determined 
to deliver the best performance to the client. If the 
request fails, the next server in the director applet’s 
ranked list is contacted with the request. In this way, 
the director applet provides transparent access to arbi- 
trary server groups. As aresult of the well-defined in- 
terface between the client interface applet and the di- 
rector applet (as described in Section 3.3), individual 
director applets are easily interchanged for many dif- 
ferent services. For example, the director applet pro- 
viding transparent Telnet access toa cluster of work- 
stations can also be used for services such as chat or 
FTP. 


3.2 Bootstrapping Applet Retrieval 


The goal of transparent access to network services 
would be compromised if the Smart Client applets 
necessary for service access must be downloaded 
from a single hostname before every service access. 
We have created a scalable bootstrapping mechanism 
to circumvent this single point of failure. To remove 
the single point of failure associated with a single 
hostname, we have modified jfox [Wendt 1996], an 
existing Java Web browser, to support a new se?vice 
name space, e.g. service://now chat service. For the 
servicename space, the browser contacts one of many 
well-known search engines with a query. These well- 
known search engines serve the same purpose as the 
root name servers in DNS. 

Currently, the browser contacts Altavista [Dig 
1995] with a query requesting an HTML page whose 
title matches the service name, e.g. “now chat ser- 
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vice”. In this way, Smart Clients leverages highly- 
available search engines to provide translations from 
well-known service names toa URL. The URL points 
to a page containing a service certificate. The cer- 
tificate includes references to both the client interface 
and director applets. In addition, the certificate con- 
tains some initial guess as to service group member- 
ship. This hint initializes the director applet, allowing 
the applet to validate the list by contacting one of the 
nodes. Figure 4 shows a certificate used for the NOW 
chat service. 

Jfox has been extended to cache the Smart Client 
applets associated with individual services, the lo- 
cation of the service certificate, the certificate itself 
and any additional state that the Smart Client direc- 
tor needs for the next access to the service. While 
the client interface applet and service-ceruficate are 
cached using normal browser disk caching mecha- 
nisms, the director state is saved by serializing the di- 
rector applet (and any relevant instance variables) to 
disk using Java Object serialization [(Jav 1996]. Thus, 
on subsequent service accesses, the director applet 
need not rely on the initial group membership con- 
tained in the service certificate. Instead, it can use 
the last known service group membership. With this 
bootstrapping mechanism, no network communica- 
tion is necessary to load the service applets after the 
initial access. 

Currently, the service certificate and applets are 
cached indefinitely. In the future, we plan on adding 
a time-out period to the server certificate. After the 
timeout, the browser can revalidate both its service 
certificate andthe associated applets. If either the cer- 
tificate or applets are inaccessible, the decision to pro- 
ceed with the cached state can be made on a service- 
specific basis. 

Note that with the exception of bootstapping, 
the implemented applets work on unmodified Java- 
capable Web browsers such as Netscape Navigator 
and Internet Explorer. Further, mainstream browsers 
such as Internet Explorer allow for installation of fil- 
ters over the entire browser [Leach 1996]. Such a fil- 
ter would allow our bootstrapping mechanism to be 
implemented in widely used Web browsers. 

The bootstrapping problem has been addressed in 
other contexts. For example, diswibuted applications 
need access to DNS without a name server. Such 
applications fall back to sending queries well-known 
root name servers when it is unable to resolve a host- 
name. As another example, applications which com- 
municate through RPCs must bind to a server without 
using an RPC. This problem is also addressed by us- 
ing broadcast to initiate binding to RPC servers on the 
network. 
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<HTML> 
<TITLE>now chat service</TITLE> 


<META name="description" content="now chat service"> 
<META name="keywords" content="now chat service"> 


<APPLET name="now_chat" 


codebase="Chat" code="Chat.class"> 
name="director"” value="now_chat_director"> 


<param 
</APPLET> 
<APPLET name="now_chat_director" 


codebase="Chat" code="ChatDirector.class"> 


<param name="nodes" 


value="u81.cs.berkeley.edu, u82.cs.berkeley.edu, u83.cs.berkeley.edu"> 


</APPLET> 
</HTML> 


Figure 4: This example of a service certificate references both the client interface applet (Chat.class) and the direc - 
tor applet (ChatDirector.class). Initial service group membership (u81,u82 and u83) is fedto the director applet for 
bootstrapping purposes. The director applet contacts one of these machines to obtain group membership updates. 


3.3. Smart Clients API 


In this subsection, we will describe the Smart Clients 
API. The goal of the API is to provide a generic in- 
terface for service providers to develop transparent 
access to their servers and to make it easier for pro- 
grammers to implement applications for distributed 
services. In the interests of brevity, we do not doc- 
ument the interface in its entirety. Interested readers 
can download the Java classes implementing the API 
tosee how theclasses are used to implement a number 
of sample applications (as described in Section 4). 
Figure 5 presents a high-level overview of the Java 
methods which make up the Smart Clients API. The 
IDirector interface provides a simple abstraction 
of a service to the application programmer. The pro- 
grammer makes director requests through the IDirec- 
tor interface. The requests are then sent by the direc- 
tor applet to one of the service nodes; note that the ap- 
plication programmer is not concerned with managing 
server nodes. If the request fails, a director exception 
is raised. Inresponse, the director will first allow the 
request to clean up any state, thenresend the request to 
another server. The director applet takes a best effort 
approach in delivering the request. Thus, a return of 
false from the delivery request indicates a catastrophic 
failure of the service, i.e. all servers have failed. 


4 Sample Applications 


4.1 Telnet Front-End for a NOW 


The NOW (Network of Workstations) Pro ject [Ander- 
son et al. 1995a] at UC Berkeley provides approx- 
imately 100 workstations for use within the depart- 


ment; however, it is difficult for users to know which 
of the 100 machines is least loaded. To address this 
problem, we developed a Web page containing a sin- 
gle button which, when pressed, opens a telnet win- 
dow onto the least loaded machine in the NOW clus- 
ter. 

The implementation of the telnet application is 
straightforward. The telnet Web page encapsulates 
the necessary Smart Clients applets. The director ap- 
plet periodically polls the NOW’s operating system, 
GLUnix (Ghormley et al. 1995], to retrieve the load 
averages of inachines in the cluster through a simple 
Cominon Gateway Interface (CGI) program. When 
the user clicks on the telnet button (provided by the 
client interface applet), a request is sent to start a tel- 
net window on the least loaded machine in the cluster. 
If the director applet notices that amachine has failed 
it will not submit telnet requests to that node. We 
are currently investigating a fault-tolerant telnet ser- 
vice which re-opens a telnet window (with saved state 
such as the current working directory and environ- 
ment variables) in the event of node failure. The fault- 
tolerant telnet would pass this saved state through the 
RequestException object (as described in Fig- 
ure 5. 


4.2 Scalable FTP Interface 


We have also used Smart Clients to build a scalable 
frontend for FTP sites. As a motivating example, 
the Netscape Navigator FTP download page? con- 
tains twelve hyperlinks for netscape FTP hosts. Users 
choose among netscape sites or mirrors to perform 


2http://www.netscape.com/comprod/mirror- 
/client.download. html 
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// Interface to encapsulate all client interface applet requests 


public interface IRequest { 


// Downcall from the director to the request object. 
Throw RequestException if an error occurs 


// on ‘hostname’. 


public void action(String hostname) 
// Downcall from the director to the request object upon failure. 
The state of the failed request consists 


// any necessary cleanup code. 


Perform the action 


throws RequestException; 
Perform 


// of the ’oldhostname’ and the RequestException that was thrown from the 


// action method 


public void cleanup(String oldhostname, RequestException t); 


} 


// Generic interface for all director applets 


public interface IDirector { 


// Execute the request r on a hostname of the director’s choosing. 


If the 


// request object throws a RequestException, assume failure of the node 
// and reapply the request after calling the request’s cleanup method. 


// If there are no remaining nodes, 
public boolean apply(IRequest r); 


return false. Otherwise, return true. 


Figure 5: This Figure describes some of the interfaces in the Smart Clients API. Classes that implement the di- 
rector mterface have been written to provide much of the functionality necessary to simple directors, including 
randomized directors (picking a random machine) and directors based on choosing the least loaded server. 


manual load balancing. To improve on this intesface, 
Smart Client applets present a single download but- 
ton to the user. The client interface applet delivers re- 
quests to the director to retrieve a file, while the di- 
rector picks a machine at random from a static set of 
nodes. When the user presses the button, the applet 
transparently determines the best site for file retrieval. 


To demonstrate the scalability available from us- 
ing Smart Clients, we measure delivered bandwidth 
to Smart Client applets running in Netscape Navi- 
gator from a varying number of FTP servers. We 
emphasize that the choice of FTP site is transpar- 
ent to the end user (a single button is pressed to be- 
gin file rewieval) and that our FTP application can be 
downloaded to mn with unmodified servers and Java- 
compliant browsers. The tests were run on a cluster 
of Sun Sparcstation 10’s and 20’s interconnected by a 
10 Mbps Ethernet switch. The Ethemet switch allows 
each machine in the cluster to simultaneously deliver 
10 Mb of aggregate bandwidth to the rest of the clus- 
ter without the the contention associated with shared 
Ethernet networks. Either one, two, or four of the ma- 
chines are designated FTP servers, while the rest of 
the machines in the cluster attempt 40 consecutive re- 
trievals of a 512 KB file. This experimental setup ap- 
proximates multiple FTP mirror sites spread across 
the wide area. 


Figure 6 summarizes the results of the FTP scal- 
ability tests. The graph shows aggregate delivered 
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bandwidth in megabytes per second as a function of 
the number of client machines making simultaneous 
file requests. For one FTP server, 8 clients are able 
to saturate the single available Ethernet link at 1.2 
MB/s. The results for two and four FTP servers 
demonstrate that the random selection of an FTP 
server used within the applet delivers reasonable scal- 
ability. Sixteen clients are able to retrieve approxi- 
mately 2 MB/s from two servers, while 16 clients sat- 
urate four servers at approximately 3 MB/s. 

For small number of clients, a single FTP server 
demonstrates the best performance because all 40 file 
downloads are made during a single connection. For 
the multi-server tests, multiple connections and dis- 
connections take place as the clients attempt to ran- 
domly balance load across the servers. In the fu- 
ture, this problem can be avoided by implementing 
site affinity with successive file requests (if deliv- 
ered bandwidth on the previous was deemed satisfac- 
tory), implementing a load daemon on the nodes to al- 
low the clients to continuously choose lightly loaded 
machines, or by using services such as the Internet 
Weather Map [Mat 1996] to choose low-latency hosts. 
This information can be used to incrementally scale 
connections to available FTP servers (i.e. allow some 
machines to be recruited only when needed). 


3We were unable to take measurements for more than 16 simul- 
taneous clients making requests to a single server because the FTP 
server would not allow more than 16 simultaneous file retriewals. 
We plan to investigate this limitation further. 
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Figure 6: This figure demonstrates how a Smart Client interface to FTP delivers scalable performance. The graph 
shows delivered aggregate bandwidth as a function of number of clients making simultaneous requests. 


4.3 Scalable Chat 


The next application we implement is Internet chat. 
The application allows for individuals to enter and 
leave chat rooms to converse with others co-located 
in the same logical room. The chat application is im- 
plemented as Java applets run through a Web browser. 
Figure 7 depicts our implementation of the applica- 
tion. Individual chat rooms are modeled as files ex- 
ported through WebFS [Vahdat etal. 1996], a file sys- 
tem allowing global URL read/write access. WebFS 
provides for negotiation of various cache consistency 
protocols on file open. 

We extended WebFS to implement a scalable 
caching policy suitable to the chat application. In 
this model, when a user wishes to enter a chat room, 
the client simply opens a well-known file associated 
with the room. This operation registers the client 
with WebFS. Read and write operations on the file 
correspond to receiving messages from other chatters 
and sending a message out to the room, respectively. 
On receiving a file update (new message), WebFS 
sends the update to all clients which had opened 
the file for reading (i-e., all chatters in a room). In 
this case, the client interface applet consists of two 
threads, a read thread continuously polling the chat 
file and an event thread writing user input to the chat 
file. These read/write requests are sent to the chat 


server via the director applet. 


The director sends the request to the hostname that 
represents the best service node at the tme. If the re- 
quest does not complete, the request raises an excep- 
tion to the director applet. The director applet then 
calls the service-specific cleanup routine for the re- 
quest, and resends it to another service node. Note 
that the request takes a service specific failure event, 
such as chat file not found or WebFS server is down, 
and translates it intoa general exception. Thus, the di- 
rector applet can be written for a cluster of machmes 
and reused for many different protocols: FTP, Telnet 
and chat. 


From the above discussion, it is clear that a sin- 
gle WebFS server can quickly become a performance 
bottleneck as the number of simultaneous users is 
scaled. To provide system scalability, we allow multi- 
ple WebFS servers to handle client requests for a sin- 
gle file. Each server keeps a local copy of the chat file. 
Upon receiving a clientupdate, WebFS distributes the 
updates to each of the chat clients connected to it. 
WebFS also accumulates updates, and every 300 ms 
propagates the updates to other servers in the WebFS 
group. This caching model allows for out of order 
message delivery, but we deemed such semantics to 
be acceptable fora chatapplication. If it is determined 
that such semantics are insufficient, well-known dis- 
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write(http://serverl/chat, "“Hello!"); 


Smart 
Client 


Chat Server pool 





Server1 


read(http://serverl/chat, &x); 


read(http://server2/chat, &x); 





WebFS 
Server2 


read(http://server2/chat, &x); 


Figure 7: Implementation of the chat application. Chat rooms are modeled as files with reads corresponding to 
receiving conversation updates and writes to sending out a message. Ona write, the WebFS updates all its clients; 
the updates are propagated to other servers in a lazy fashion. 


tribution techniques [Ladin et al. 199 2, Birman 1993] 
can be used to provide strong ordering of updates. 

Since the read requests are idempotent, and the 
write requests are atomic with respect to WebFS, 
the chat application is completely tolerant to server 
crashes. This fault transparency provides the illusion 
of a single, highly-available chat server machine to 
the programmer of the Chat client interface applet. 
Figure 8 demonstrates the behavior of the chat appli- 
cation in the face of a failure to the client’s pnmary 
server. The graph plots response time as a function of 
elapsed time. The graph shows that chat delivers less 
than 5 ms latency to the end user. On detecting a fail- 
ure, the latency jumps to 1 second before switching to 
a secondary WebFS server, at which point the latency 
returus to normal. 


5 Summary 


We have described a solution to the problem of scal- 
ability and high-availability which logically migrates 
server functionality into the client. We will now re- 
visit the goals set forth in Section 1 andexamine how 
Smart Clients addresses each goal: 


e Incremental Scalability - When a machine is 
added to or removed froma service group, the di- 
rector applet supplied by the service updates its 
list of peer servers. The director applet discov- 
ers such modifications through a service-specific 
mechanism, e.g. keep-alive messages, connect- 
ing to a well-known port, or refetching the ser- 
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vice certificate. 


e Load Balancing - The director applet maintains 
a service-specific notion of load (such as number 
of processes, number of open connections, avail- 
able bandwidth). Using this information, client 
requests are routed to the best candidate node. 


e Fault Transparency - When a failure occurs, the 
director applet allows the client to clean up any 
stale state before resending the request toanother 
server. Providing fault transparency requires ser- 
vice support when the request is non-idempotent. 
For example, in the chatapplication, the chatser- 
vice provides atomic writes to the chat transcript. 


e Wide Area Service Topology - Smart Clients 
does not place any restriction on topology of 
server machines. In fact, the director applet can 
choose an arbitrary site based on considerations 
such as proximity or predicted performance. 


e Scalable Services To Legacy Servers - Existing 
servers that replicate a read-only database can 
be grouped together for access by Smart Clients. 
With knowledge of the group set, the director ap- 
plet can load balance client requests among ex- 
isting unmodified servers. 


Finally, we believe that the architecture presented 
in this paper can simplify implementation of scalable 
services with respect to at least fault transparency and 
load balancing. The Smart Client director provides 
the illusion of a single, highly available server. This 
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Figure 8: Chat response times in the face of server load. The chat application delivers latencies of approximatel y 
10 ms uder normal circumstances. On server failure, the applications takes one second to switch to a peer server. 


model substantially decreases the complexity of the 
client interface applet since this applet need not be 
concerned with maintaining the set of server nodes. In 
addition, because of the public interface between the 
client interface and director applets, each can be writ- 
ten once and interchanged for a number of different 
services. 


6 Related Work 


The problem of transparently providing fault trans- 
parency and load balancing to network services has 
been addressed previously in a number of contexts. 
File systems have used server-side replication of vol- 
umes and servers to provide fault transparency in 
systems such as Deceit [Marzullo et al. 1990}, 
AFS [Howard etal. 1988], and HA-NFS [Bhide etal. 
1991]. More recently, systems such as xFS [Ander- 
son et al. 1995b] and Petal [Lee & Thekkath 1996] 
use Client-side techniques to improve overall file sys- 
tem performance. Many distributed clusters perform 
load balancing on the level of jobs (interactive or 
otherwise) submitted to the system [Nichols 1987, 
Bricker etal. 1991, Douglis & Ousterhout 1991, Zhou 
et al. 1992]. Once again, all these systems implement 
server-side solutions for load balancing and require 
client intervention to spread jobs among Cluster ma- 
chines. 

Perhaps most closely related to Smart Clients are 
Transaction Processing monitors [Gray & Reuter 


1993] (TP monitors). TP monitors provide function- 
ality similar to Smart Clients for access to databases. 
The TP monitor functions as the director for transac- 
tions to resource managers, accounting for load on 
machines, the RPC program number, and any affin- 
ity between client and server. Resource managers 
are usually SQL databases, but can be any server 
that supports transactions. TP monitors differ from 
Smart Clients in that they deal exclusively with trans- 
actional RPCs as the communication mechanism to 
the servers. TP monitors are also more closely cou- 
pled with the server nodes since they are responsible 
for starting new server processes. 


The Smart Client director can be tailored to each 
service, while the TP monitoris more ofa general pur- 
pose director. Smart Clients also provide a bootstrap- 
ping mechanism to remove the single point of failure 
associated with downloading the necessary routing 
software. In addition, the Smart Client code is signif- 
icantly more lightweight than the TP monitor which 
often includes many of the features of traditional op- 
erating systems: process managemenucreation, au- 
thentication, and linking resource manager object 
code with the Transaction Processing operating sys- 
tem (TPOS). This lightweight nature enables Smart 
Clients to be downloaded into existing Web browsers 
to customize existing Internet services. 


Also related to our systems are ISIS [Birman 1993] 
and gossip architectures [Ladin et al. 1992] which 
provide a substrate for developing distributed applica- 
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tions. ISIS provides reliable group communication to 
support many of the applications we envision. Gos- 
sip architectures use front-ends analogous to Smart 
Clients to access replicated servers which are kept 
consistent through lazy updates. Both systems are or- 
thogonal to our work in many respects and still use 
server-side techniques for much of their functionality. 


7 Conclusions 


In this paper, we have shown that existing solutions 
to providing transparent access to network services 
suffer from a lack of knowledge about the semantics 
of individual services. The recent advent of Java al- 
lowing distribution of portable client code presents 
an opportunity to migrate certain service functional- 
ity onto the client machine. We show that such mi- 
gration can simplify service implementation and im- 
prove the quality of service to users. To this end, 
we describe our implementation of Smart Clients to 
show the greater flexibility available from a client- 
side approach to building scalable services. The 
Smart Clients API provides a generic interface for ac- 
cessing network services. Further, the decomposition 
of the API into individual client interface and direc- 
tor applets allows interchanging of these applets for 
a variety of services. The Smart Clients API is spe- 
cialized to provide scalable access to three sample ser- 
vices: telnet, FTP, and Internet chat. 

In the future, we will further explore service- 
specific load balancing techniques for achieving scal- 
ability. We also plan to demonstrate how Smart 
Clients can be used to provide load balancing and 
fault transparency for services replicated across the 
wide area. We also plan to implement an interface 
for wansparent access to HTTP servers and a fault- 
tolerant telnet client. Migration of other code, be- 
sides the director, from the server to the client will also 
be explored. 
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Abstract 


The Solaris MC distributed operating system provides a 
single-system image across a cluster of nodes, including 
distributed process management. It supports remote 
signals, waits across nodes, remote execution, and a 
distributed /proc pseudo file system. Process manage- 
ment in Solaris MC is implemented through an object- 
oriented interface to the process system. This paper has 
three main goals: it illustrates how an existing UNIX 
operating system kernel can be extended to provide 
distributed process support, it provides interfaces that 
may be useful for general access to the kernel’s process 
activity, and it gives experience with object-oriented 
programming in a commercial kemel. 


1 = Introduction 


The Solaris MC research project! has set out to extend 
the Solaris® operating system to clusters of nodes. We 
believe that due to technology trends, the preferred 
architecture for large-scale servers will be a cluster of 
off-the-shelf multiprocessor nodes connected by an 
industry-standard interconnect. The Solaris MC oper- 
ating system is a prototype distributed operating system 
that provides a single system image for such a cluster 
and will provide high availability so that the cluster can 
survive node failures. Solaris MC is built as a set of 
extensions to the base Solaris UNIX® system and 
provides the same ABI/API as the Solaris OS, running 
unmodified applications. 


As process management is a key component of an oper- 
ating system, a single-system image cluster operating 
system must implement a variety of process-related 
operations. In addition to creating and destroying 
processes, a POSIX-compliant[13] operating system 


1. Solaris MC is an internalname of a research project at 
Sun Microsystems Laboratories. More information on 
the project can be obtained from 

http://www. sunlabs . com/research/solaris-mc 


must support signalling a process, waiting for child 
process termination, managing process groups, and 
handing tty sessions. In addition, Solaris provides a 
file-based interface to processes through the /proc file 
system, which is used by ps and debuggers. Finally, 
cluster-based process management must provide addi- 
tional functionality such as controlling remote execution 
of processes. 


Our decision to base the Solaris MC operating system 
on the existing Solaris kernel influenced many of the 
architectural decisions in the process management 
subsystem. One goal was to minimize the amount of 
kernel change required, while making the changes 
necessary to provide a fully-transparent single-system 
image. Thus, the Solaris MC process subsystem can be 
distinguished on one hand from cluster process systems 
that run entirely at user level (such as GLUnix [22}) and 
on the other hand from systems that build distributed 
process management from scratch (such as Sprite[18]). 
By putting distributed process support in the kernel, we 
gain both better performance and more transparency 
than systems that operate at user level. By using most of 
an existing kemel, we reduce the development cost and 
increase the commercial potential. This is similar in 
concept to Locus [20] and OSF/1 AD TNC [25]. 


As well as describing an implementation of cluster 
process management, this paper describes interfaces 
into the kernel’s process management. UNIX systems 
lack a good kemel interface allowing access to the low- 
level process management. In comparison, the file 
system has a well-defined existing interface between the 
kernel and the underlying storage system through 
vnodes [15], allowing new storage systems or distrib- 
uted file systems to be “plugged in.” The object-oriented 
interfaces described below are a first cut at providing 
similar access to processes. 


This paper describes the process management 
subsystem of the Solaris MC operating system. Section 
2 provides an overview of the implementation of distrib- 
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uted process management. Section 3 discusses the inter- 
faces that the Solaris MC process management uses. 
Section 4 gives implementation details, Section 5 gives 
performance measurements, Section 6 compares this 
system with other distributed operating systems, and 
Section 7 concludes the paper. 


2 Distributed process management 


2.1 Overview of the Solaris MC operating 
system 


The Solaris MC operating system [14] is a prototype 
distributed operating system for a cluster of nodes that 
provides a single-system image: a cluster appears to the 
user and applications as a single computer running the 
Solaris operating system. Solaris MC is built as a set of 
extensions to the base Solaris UNIX system and provides 
the same ABI/API as Solaris, running unmodified appli- 
cations. The Solaris MC OS provides a single-system 
image through a distributed file system, clusterwide 
process management, transparent access from any node 
to one or more extemal network interfaces, and cluster 
administration tools. High availability is currently being 
built into the Solaris MC OS: if a node fails, the 
remaining nodes will remain operational, and the file 
system and networking will transparently recover. 


The Solaris MC OS is built as a collection of distributed 
objects in C++, for several reasons. First, the object- 
oriented approach gives us a good mechanism for 
ensuring that components communicate through well- 
defined, versionable interfaces; we use the IDL interface 
definition language [21]. Second, a distributed object 
framework lets us invoke methods on objects without 
wortying if they are local or remote, simplifying design 
and programming. Finally, the object framework keeps 
track of object reference counting, even in the event of 
failures, and provides subsystem failure notification, 
simplifying the design for high availability. 


We use an object communication runtime system [5] 
based on CORBA [16] that borrows from Solaris doors 
for interprocess communication and Spring subcontracts 
[10] for flexible communication semantics. Our object 
framework lets us define object interfaces using IDL 
[21]. We implement these objects in C++, and they can 
reside either at user level or in the kernel. References to 
these objects can be passed from node to node. A 
method on an object reference is invoked as if the object 
were a Standard C++ object. If the object is remote, the 
object framework will transparently marshal arguments, 
send the request to the object’s node, and retum the 
reply. Thus, the object framework provides a convenient 
mechanism for implementing distributed objects and 
providing communication. 
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The Solaris MC proxy file system (PXFS) provides an 
efficient, distributed file system for the cluster. The file 
system uses client and server-side caching using the 
Spring caching architecture, which provides UNIX 
consistency semantics. The file system is built on top of 
vnodes, as shown in Figure 1: on the client side, the file 
system looks like a normal vnode-based file system, and 
on the server side an existing file system (such as UFS) 
plugs into the Solaris MC file system using the vnode 
interface to provide the actual file system storage. The 
components of the Solaris MC file system communicate 
using the object system. 


2.2 Distributed process management 


The process management component of Solaris MC was 
designed to satisfy several goals: 


¢ Provide binary compatibility and POSIX semantics. 
e Minimize changes to the kernel. 


e Use an object-oriented design based on CORBA and 
IDL. 


¢ Provide good performance: The system should avoid 
node-to-node communication for local process 
operations, for instance. 


e Provide reliability in the face of failures: The system 
should be able to continue working if nodes fail or 
are added to the system. 


To support these goals, we implemented process 
management using the architecture shown in Figure 2. 
The process management globalization code is written 
as a C++ module that is loaded into the kernel address 
space. Conceptually, this module sits on top of the 
existing process code; system calls are directed into the 


kernel 


PXFS client caches 
ey 


vnode/VFS 





| file system | file system 


(a) Standard Solaris (b) Solaris MC 


Figure 1: Structure of the Solaris MC proxy file system 
(PXFS). Solaris MC splits the file system across 
multiple nodes by adding PXFS client and PXFS server 
layers above the underlying file system. These layers 
communicate through IDL-based interfaces, as do the 
file system caches. 
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globalization layer. Much of the kernel’s existing 
process management (e.g. threads, scheduling, and 
process creation) were used unchanged, some kemel 
operations were modified to hook into the globalization 
layer (e.g. fork, exec, and signals), and a few compo- 
nents of the kernel were largely replaced (e.g. wait and 
process groups). The distributed /proc file system is 
implemented independently, as will be discussed in 
Section 4.6. 


We made the design decision to have a virtual process 
object associated with each process and a process 
manager object associated with each node. The virtual 
process object handles process-specific operations and 
holds the state necessary to globalize the process, while 
the process manager object handles node-specific opera- 
tions (such as looking up a process ID or pid) and looks 
after the node-specific state. These objects communicate 
using IDL interfaces to perform operations. 


A second key design decision was to keep most kemel 
process fields accurate locally, so most kernel code can 
run without modification. For instance, the local process 
ID is the same as the global process ID, rather than 
implementing a global pid space on top of a different 
local pid space. One alternative would be to do away 
with the existing kernel pid data structures and only use 
the globalized data. This approach would require exten- 
sive changes to the kernel, wherever these data struc- 
tures were accessed. In our approach, the keel needed 
to be modified only when the local and global pictures 
wouldn’t correspond, for instance when children are 
accessed or a process is accessed by pid. Another 
approach would be to use separate local and global 
pids; this would require mapping between them when- 
ever the user supplies or receives a pid. This would 
reduce the kernel changes, but the mapping step would 


Main process management 


System call interface 


Process module : 
globalization layer | walt 


proc groups 


Existing kernel 


threads fork | 
scheduling] | exec 
creation signal 





make all operations on pids slower and more compli- 
cated, 


The main objects used in the system are shown in Figure 
3. The standard Solaris OS stores process state in a 
structure called proc_t. We added a field to this struc- 
ture to point to the virtual process object; since existing 
Solaris code passes around proc_t pointers, we need 
some way to get from this to the virtual process object. 
The virtual process object manages the global 
parent/child relationships through object references to 
the parent and children (which may be on the same or 
different nodes). It also contains global process state 
information, exit status for children, and forwarding 
information for remotely execed processes. Additional 
objects (discussed in Section 4.2) manage process 
groups and sessions. 


The process manager contains the node-specific data. In 
particular, it holds a map from pids to virtual process 
objects for all local processes and processes originally 
created on this node. (This is analogous to the home 
structure in MOSIX [3].) It also contains object refer- 
ences to the process managers on the other nodes in the 
cluster. Thus, the process manager is used to locate 
processes, to locate other nodes, and to perform opera- 
tions that aren’t associated with a particular process (e.g. 
sigsendset). 


To locate processes, we partitioned the pid space 
among the nodes, so the top digits of the pidhold the 
node number of the original node of the process (similar 
to OSF/1 [25] or Sprite [18]). Note that a process may 
move from node to node through the remote exec oper- 
ation; since the pid can’t change, the pid will specify 
the original node, not necessarily the current node. 
Thus, given a pid it is straightforward to determine the 


/proc file system 


other 
nodes 
other 
nodes 


Existing 
/proc 





Figure 2: Structure of process management in the Solaris MC operating system. Process management is divided into 
two independent components. The main component, which supports UNIX process operations, is implemented as a 
module that interacts with the kernel to provide globalized process operations. The /proc component, which 
provides a global implementation of the /proc file system, is implemented as part of the Solaris MC PXFS file 


system. 
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\ Node 


| 
| 
Virtual proc | 
| 
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Other process 
managers 


Figure 3: The objects for process management. For each local process, the existing Solaris proc_t structure points to 
a virtual process structure that holds the parent and child structure as well as other globalizing information. The 
process manager has a map from pids to the virtual processes and from the node ids to the process managers. Solid 


circles indicate object references. 


original node. The process manager on this node can be 
queried for an object reference to the virtual process 
object, and then the operation can be directed to that 
object. Instead of updating all references when a process 
changes nodes, we leave a forwarding reference in a 
virtual process object on the original node. When this is 
accessed, it sends a message to the requestor and the 
stale pointer is updated. Because the object framework 
does reference counting, the forwarding object will 
automatically be eliminated when there are no longer 
any references to it. 


The virtual process object in Solaris MC can be 
compared with the global process structures in OSF/1 
AD TNC [25]. In the OSF/! system, the only process 
structure visible to the base server is the simple vproc 
structure, which is analogous to a vnode structure. The 
vproc structure contains the pid, an operation vector, 
and a pointer to internal data. This pointer references a 
pvproc structure, which holds the globalization infor- 
mation (location, parent-sibling-child relationships, 
group, and session), and another operation vector which 
directs operations to local process code or a remote 
operation stub that performs a RPC. 


In both systems, the virtual process layer sits between 
the application’s process requests and the underlying 
physical process code, providing globalization. An 
OSF/I vproc structure is roughly analogous to a Solaris 
MC object reference, while the pvproc holds the actual 
data and roughly corresponds to the Solaris MC virtual 
process object. Thus, OSF/! encapsulates process data 
by placing it in a separate pvproc structure, while 
Solaris MC encapsulates it inside an object. One key 
difference occurs on nodes that access a process that is 
located on a different node. In OSF/I, these nodes will 
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have both a vproc and pvproc structure and the associ- 
ated process state, while Solaris MC will only have an 
object reference. A second difference is that OSF/1 
directs operations remotely or locally through an opera- 
tion vector in the pvproc structure, while Solaris MC 
directs operations remotely or locally through the 
method table associated with the virtual process object 
reference, which is managed automatically by the object 
framework. Thus, the object framework hides the real 
location of a Solaris MC virtual process object. Finally, 
the marshalling and transport of requests between nodes 
is performed transparently by the Solaris MC object 
subsystem, while in OSF/1, the vproc system needed its 
own communication system, using the TNC message 
layer. 


3 ‘Process interfaces 


Unlike the file system with its vnode interface, the 
process control in Solaris lacks a clean interface for 
extension. Adding a file system to UNIX used to be a 
difficult task. Early attempts at distributed file systems 
(e.g. Research Version 8 [1] and EFS [6]) used ad hoc 
interfaces, while NFS [26] and RFS [2] used more formal 
interfaces, vnodes and the file system switch respec- 
tively. These interfaces required extensive restructuring 
of the kernel, but the result is that adding a new file 
system is now relatively straightforward. 


The same restructuring and kernel interface needs to be 
considered for the process layer in the kernel. The 
/proc file system provides one interface to manipulate 
processes, but it is limited in functionality to providing 
status and debugging control. The Locus vproc archi- 
tecture [25] and the Solaris MC virtual process object 
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can be considered steps towards a general interface anal- 
ogous to vnodes. 


One approach would be to provide distributed process 
support through extensions to the /proc interface; we 
didn’t take this approach for several reasons. First, a file 
system based interface has a serious “impedance 
mismatch” with the operations that take place on 
processes; a simple process system call would get 
mapped to an open-write-close. Second, the overhead 
of opening files to manipulate processes is undesirable. 
Finally, a /proc-based interface would probably result 
in an assortment of ioctls more confusing than the 
current /proc ioctls; we desired a cleaner interface. 


We broke the process interface into two parts: a proce- 
dural and an internal object-oriented interface. The 
procedural interface is used for communication between 
the existing kernel and the globalization layer; this inter- 
face is described in Section 3.1. The object-oriented 
interface is used by the global process implementation, 
and is described in Section 3.2. 


3.1 Procedural interface 


The procedural interface, given in Table 1, is a thin layer 
on top of the object-oriented interface for use by the 
existing C kernel, for both the system call layer and the 
underlying kernel process code to communicate with the 
globalization layer. 


Most of the procedural calls are used to direct system 
calls into the globalization layer: VP_FORK, VP_EXEC, 
VP_VHANGUP, VP_WAIT, VP_SIGSENDSET, 
VP_SETSID, VP_PRIOCNTLSET, VP_GETSID, 
VP_SIGNAL, and VP_SETPGID. Note that operations 
such as getpid, nice, or setuid don’t go into the 
globalization layer as they operate on the current (local) 
process. 


The up-calls from the kemel to the globalization layer 
are VP_FORKDONE, at the end of a fork to allow the 
globalization layer to update its state; VP_EXIT, when a 
process exits, VP_SIGCLDMODE, when a process changes 


its SIGCLD handling flags; and VP_SIGCLD, when a 
process would send a SIGCLDto its parent. 


These procedural calls are implemented by having 
VP_FORK, VP_EXIT, VP_SETSID, VP_VHANGUP, 
VP_WAIT, VP_SIGCLD, and VP_SIGCLDMODE call the 
virtual process object associated with the current 
process, while the remaining operations call the node 
object on the current node. This is an optimization; since 
the former set of operations act on the current process it 
is more efficient to send the operations to the virtual 
process object directly. 


3.2 Internal interface 


The object-oriented interface to the node object is given 
in Table 2. The node methods consist of operations to 
manage node state (addvpid, registerloadmgr, 
getprocmgr, addprocmgr, findvproc, find- 
vpid, lockpid, unlockpid), operations that create 
new processes (rfork, rexec), and operations that act 
on multiple processes (sigsendset, priocntlset). 


The interf.ace to the process object is given in Table 3 
and contains methods that act on an existing process. 
Most of these methods correspond to process system 
calls. The childstatuschange and releasechild 
methods are used for waits; they are described in more 
detail in Section 4.4. The childmigrated method 
keeps the parent informed of the child’s location. 


One potential future path is to merge the /proc file 
system with the object-oriented interface, allowing 
/proc operations to be performed through the object 
interface. The /proc file system would then be just a 
thin layer allowing access to the objects through the file 
system. This would make the process object into a 
single access path into the process system. It would also 
simplify the keel code by combining all the process 
functionality into one place. Now, the Solaris /proc 
code is entirely separate from the virtual process code. 


VP_FORK(flag, pid, node) 
VP_EXIT(flag, status) 
VP_EXEC(fname, argp, envp, node) 


VP_SIGSENDSET(psp, sig, local) 
VP_SETSID(flag, rval) 


VP_PRIOCNILSET(version, psp, cmd, arg, rval, 
local) 


VP_GETSID(pid, pgid, sid) 

VP_SIGNAL(pid, sigsend) 

VP_SIGCLD(cp) VP_SETPGID(pid, pgid) 

VP_SIGCLDMODE(mode) VP_FORKDONE(p,cp,mig) 

Table 1: Procedural calls into the virtual process layer. These calls are just a thin layer between the C code of the 
Solaris kernel and the C++ code of the Solaris MC module 





VP_VHANGUP( ) 


VP_WAIT(idtype, id, ip, options) 
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4 Details of process management 


This section describes the main process functions in 
more detail. Sections 4.1, 4.2, 4.3, and 4.4 describe how 
we implemented signals, process groups, remote execu- 
tion, and cross-node waits, respectively. Section 4.5 
discusses what failure recovery will be provided. 
Section 4.6 explains how a global /proc file system 
was added to the Solaris MC file system. Finally, 
Section 4.7 discusses our experience with object- 
oriented programming in the kernel. 


4.1 Signals 


Several issues complicate signal delivery in a distributed 
system. Delivery of a signal to a remote process is 
straightforward; an object reference to the process is 
obtained from the pid, the signal method is invoked on 
that process object, and the object delivers the signal 
locally. 


The more complicated cases are a kill to a process 
group, the sigsend system call, and the sigsendset 
system call. To signal a process group, the appropriate 
process group object is located, the signal method is 
invoked on this object, the process group object invokes 
the member process objects, and these object deliver the 
signals. Locking complicates this operation. 


The sigsend and sigsendset system calls allow 
signals to be sent to complex sets of processes, formed 
by selecting sets of processes based on pid, group id, 
session id, user id, or scheduler class, and then 
combining these sets with boolean operations. To handle 
these system calls, the operation is sent to each node in 
the system and each node signals the local processes 


jaddvpid(pid, vproc, update) 

registerloadmgr(mgr) 

sigsendset (psp, sig, local, srcpid, cred, 
srcsession) 


priocntlset(version, psp, amd, arg, rvp, 
local) 

getprocmgr (nodenumber ) 

addprocmgr(nodenumber, pm) 


Table 2: Methods on the node object 


signal(srcpid, cred, srcsession, signum) 
setpgid(srcpid, pgid) 


getsid(srcpid, pgid, sid) 
getpid() 
releasechild(mode, noparent) 


Table 3: Methods on the virtual process object 


1997 Annual Technical Conference 


matching the set. This can be inefficient since all nodes 
must handle the operation; if performance becomes a 
problem, we will add optimizations for the simple cases 
(such as the set specifies a single process or process 
group) so they can be handled efficiently. 


4.2 Process groups 


POSIX process groups and sessions support job control: 
a tty or pseudo-tty corresponds to a session, and each 
job corresponds to a process group. The process 
Management subsystem must keep track of which 
processes belong to which process groups, and which 
process groups belong to which sessions. Process group 
membership is used by the I/O subsystem to direct /O 
only to processes in a foreground process group, and 
send TTIN/TTOU signals to background processes. 


A POSIX-compliant system must also detect orphaned 
process groups (which are unrelated to orphaned 
processes). A process group is not orphaned as long as 
there is a process in the group with a parent outside the 
process group but inside the session. (The two prototyp- 
ical orphaned process groups are the shell and a job that 
lost its controlling shell.) Orphaned process groups are 
not permitted to receive terminal-related stop signals, as 
there is no controlling process to return them to the fore- 
ground. In addition, if a process group becomes 
orphaned and contains a stopped process, the process 
group must receive SIGHUP and SIGCONT signals to 
prevent the processes from remaining stopped forever. 


To support these POSIX semantics, the Solaris MC OS 
uses an object for each process group and an object for 
each session, with the session object referencing the 


rfork(state, astate, flags) 


rexec(fname, argp, envp, state, flags, 
newvproc) 


lockpid(pid) 

unlockpid(pid) 
findvproc(pid, local, vproc) 
freevpid(pid) 





childmnigrated(childpid, newchild) 
parentgroupchange(par_pgid, par_sid) 
childstatuschange(childpid, wcode, wdata, 

utime, stime, zombie, noparent) 
setsid(sess ) 
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member process group objects and each process group 
object referencing the member processes. For efficiency, 
we originally planned to include an additional object, 
the local process group object, which would reside on 
each node that had a process in the process group and 
would reference the local processes in the process 
group. The local process group object would be more 
efficient for signalling process groups split across 
multiple nodes and for determining orphaned process 
groups, since most messages would be intra-node, to the 
local process group object, rather than inter-node, to the 
top-level process group object. The added complexity of 
this approach, however, convinced us to eliminate the 
local process group objects. 


The process group objects detect orphaned process 
groups by keeping a reference count of the number of 
controlling links to the process group (i.e. processes 
with a parent outside the process group but inside the 
session). Operations that create or destroy a link 
(changing process group membership or exiting of the 
parent or child) inform the process group object; if the 
count drops to zero, the process group is signalled as 
required. 


4.3. Remote execution 


The Solaris MC operating system provides a remote 
execution mechanism to create processes on remote 
nodes. The rexec system call is used for remote execu- 
tion; it is similar to the UNIX exec system call except 
the call specifies a node where the new process image 
runs. The remote process then sees the same environ- 
ment as if it were running locally, due to the global file 
system and other single-system-image features. 


Process migration, which moves an existing process to a 
new node, is not yet implemented in Solaris MC, due to 
the additional complexity of moving the address space. 
The remote exec is much easier to implement and more 
efficient than migrating the address space of a running 
process (which could be partially paged out). We feel 
that most of the load-balancing benefit of remote 
processes can be obtained by positioning the process at 
execution time (as do the Plan 9 researchers [19], 
although Harchol-Balter and Downey [11] have found 
benefits from moving running processes). We plan to 
implement migration of running processes later, in 
particular to off-load processes from a node scheduled 
for maintenance; at this time we will provide rfork and 
migration (which are very similar, since a rfork is basi- 
cally a fork and a migrate). The implementation of 
migration in Solaris MC will be similar to that in other 
distributed operating systems (e.g. Sprite [7], MOSIX 
[3], or Locus [20]), moving the address space to the new 
node. 


The rexec call operates by packaging up the necessary 
state of the process (the list of open files, the proc_t 
data, and the list of children), and sending this to the 
destination node along with the exec arguments. The 
process is then started on the remote node (using a hook 
into low-level process creation, since the existing fork 
code won’t work without a local parent) and the old 
process is eliminated. Finally, the parent receives an 
object reference to the new virtual process object to 
update its child list. Our process management code then 
transparently manages the cross-node parent-child rela- 
tionships, as described in Section 4.4. 


Remote execution takes advantage of the Solaris MC 
distributed file system to handle migration of open files 
across remote execs. With the distributed file system, it 
doesn’t matter what node is performing a file operation. 
For each open file, the offset and the reference to the file 
object are sent across to the new node. These are then 
used to create vnodes for the new process. File opera- 
tions from the new process then operate transparently, 
with the Solaris MC file system ensuring cache consis- 
tency. 


The Solaris MC operating system provides support for 
multiple load balancing policies. A node can be speci- 
fied in the rexec call, or the location can be left up to 
the system. By default, rexec uses round-robin place- 
ment if a node isn’t specified, but hooks are provided for 
an arbitrary policy; the registerloadmgr method 
allows aload management object to be registered with 
the node manager. For each remote execution, the node 
manager will then query the load management object for 
the destination node id, which can be generated by any 
desired algorithm. The load manager can use algorithms 
similar to the OSF/1 AD TNC [25] or Sprite [7] load 
daemons. Since the load management object communi- 
cates with the node manager through the Solaris MC 
object model, the load management object can be imple- 
mented either in the kernel or at user level, even ona 
different node. This illustrates the flexibility and power 
of our distributed object model. 


4.4 Waits 


The Solaris MC operating system supports the UNIX 
wait semantics, even when the parent and child are on 
different nodes. 


Several approaches are possible for handling waits. 
One model, the “pull” model, has the parent request 
information from the child when it does a wait. If the 
wait does not return immediately, the parent sets up 
callbacks with the children so that it is informed when 
an event of interest occurs. A second model, the “push” 
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model, has the children informn the parent of every state 
change. 


The pull model has the advantage that no messages are 
exchanged until the parent actually performs a wait. 
However, setting up and tearing down the callbacks can 
be expensive if the parent has many children, since 
many callbacks will have to be created and removed for 
each wait. 


The Solaris MC operating system uses the push model. 
If child processes perform many state changes that the 
parent doesn’t care about, this will be more expensive. 
Normally, however, the process will just exit, requiring 
one message. It would be interesting to compare the 
number of messages required for the push vs. pull 
models on a real workload, however. 


Note that in Solaris an exiting child process normnally 
goes into a “zombie” state until the parent performs a 
wait on it. It would have been simpler to do away with 
zombie processes altogether and just keep the child’s 
exit status with the parent, but Solaris semantics require 
a zombie process which will show up on ps, for 
instance. In addition, POSIX requires that the pidisn’t 
reused until the wait is done. 


The Solaris MC operating system uses a simple state 
machine to handle waits and avoid race conditions, as 
shown in Figure 4. When the child exits, it informns the 
parent (through the childstatuschange method) and 
changes state to the right; when the parent exits or 
waits on the exited child, it informs the child (through 
the releasechild method) and the child changes state 
downwards. When the child reaches the final state, it can 
be freed from the system. 


Thus, when a child exits, the parent’s virtual process 
object keeps track of the state of the child (i.e. exited), 
its exit status, and the user and system time used by the 
child process (for use by times(2) ). 


4.5 Failure recovery 


The Solaris MC operating system is designed to keep 
running in the event of node failures. The semantics for 
a node failure are that the processes on the failed node 


running zombie 


noparen gone 


Figure 4: The state machine for processes. If a process 
exits, its state transitions to the right. If its parent exits, 
its state transitions downwards. 
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will die, and to the rest of the system it should look as if 
these processes were killed. That is, the normal seman- 
tics for waits should apply, ensuring that zombie 
processes don’t result. 


In the event of a failure, the system must recover neces- 
sary state information that was on the failed node. One 
type of information is migration pointers. If a process 
did multiple rexecs, it may have forwarding pointers on 
the failed node. After failure, it must be linked up with 
the parent. Also, after failure, process group member- 
ship must be checked to see if an orphaned process 
group resulted. 


To support failure recovery, process management in the 
Solaris MC OS was designed to avoid single points of 
failure. For instance, there is no global pid server (this 
also helps efficiency). A remotely exec’d process can 
lose its home node through a failure; in this case, the 
next live node in sequence will take over as home node, 
providing a way to find the process. 


Failure recovery in the Solaris MC operating system is 
still being developed and will be based on low-level 
failure handling in the object layer. The object layer will 
notify clients of failed invocations, will clean up refer- 
ences to the failed node, and will release objects that are 
no longer referenced from a live node. 


4.6 The /proc file system 


The /proc file system is a pseudo file system in Solaris 
[9] that provides access to the process state and image of 
each process running in the system. This file system is 
used principally by ps to print the state of the system 
and by debuggers such as dbx to manipulate the state of 
a process. Each process has an entry in /proc under its 
process id, so the process with pid nnnnn is accessible 
through /proc/nnnnn. The /proc file system supports 
reads and writes, which access the underlying 
memory of the process, and ioct1s, which can perform 
arbitrary operations on a process such as single step- 
ping, returning state, or signalling. By providing a 
global /proc, the Solaris MC OS supports systemwide 
ps and cross-node debugging. 


The /proc file system raises several implementation 
issues for a distributed process system. First, all 
processes on the system must appear in the /proc 
directory. Second, operations on a /proc entry must 
control the process wherever it resides. Third, ioct1s 
may require copying arbitrary amounts of data between 
user space on one node and the kernel on another. 
Finally, some /proc ioctls retum file descriptors for 
files that the process has opened; these file descriptors 
must be made meaningful on the destination node. 
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In Solaris MC, /proc is extended to provide a view of 
the processes running on the entire cluster, by merging 
together local /procs into a distributed picture, as 
shown in Figure 5. Thus, each node uses the existing 
/proc implementation to provide the actual /proc 
operations and a merge layer makes these look like a 
global /proc. To implement this, the readdir opera- 
tion was modified to return the contents of all the local 
/procs, merged into a single directory. The new path- 
name lookup operation returns the vnode entry in the 
appropriate local /proc, so operations on the /proc 
entry then automatically go to the right node. If the 
process migrates away, the file system operation gener- 
ates a “migrated” exception internally; the merging 
layer catches this and transparently redirects the opera- 
tions. 


The implementation of the distributed /proc file 
system illustrates the advantages of an object-oriented 
file system implementation. The merging /proc file 
system was implemented as a subclass of the standard 
Solaris MC distributed file system that redefines the 
readdir and lookup operations. The remainder of the 
file system code is inherited unmodified. Thus, the 
object-oriented approach allows new file system seman- 
tics to be implemented while sharing most of the old 
code. 


The /proc file system uses two techniques for 
handling cross-node ioct1s. One solution would be to 
modify the /proc implementation code to explicitly 
copy the argument data between the two nodes. This 
would, however, require modifications to the /proc 
source. The solution used in Solaris MC extends the 
technique used by the Solaris MC file system; the low- 
level routines to copy data to and from the kernel 
(copyin, copyout, copyinstr, copyoutstr) were 
modified so that if they are called in the course of an 
ioctl, the data is transferred from the remote node. A 
few ioct1s, however, required special handling because 
global, /proc 


global, /proc 











| local /proc | 


local /proc 





they return the numeric value of a file descriptor corre- 
sponding to an open file. For these ioct1s, the open file 
is wrapped in a Solaris MC PXFS file system object and 
the object reference is transferred to the client node. A 
new file descriptor is opened on the client corresponding 
to this object. Then, any operations on this file 
descriptor will be sent through PXFS back to the orig- 
inal open file. 


4.7 Experience with objects 


Our experience with object-oriented programming in the 
kernel is generally positive. Objects provided a cleaner 
design because of the encapsulation of data structures 
and the enforcement of well-defined interfaces. The 
object framework simplified implementation because of 
the location-independence of object invocations. This 
obviates having to keep track of which processes are 
local and which processes are global. It also makes 
remote procedure calls entirely transparent. 


One major difficulty we found with object-oriented 
programming for the kernel is that the C++ tools aren’t 
as well developed as for C. We encountered several 
compiler and linker difficulties, and templates remain a 
problem. In addition, our C++ debugging environment 
is rather primitive. For instance, we had to modify kadb 
to return demangled C++ names. 


A second issue with object-oriented programming is 
efficiency; our problems generally arose from over- 
enthusiastic use of C++ features. One problem the 
Solaris MC project encountered was that C++ excep- 
tions were costly in our implementation, both when they 
were thrown and even when they weren’t (since the 
additional code polluted the instruction cache). As a 
result, we decided to remove C++ exceptions from the 
Solaris MC implementation and use a simple error 
return mechanism instead. A second problem is that 
object-oriented programming makes it easy to have an 
excessive number of classes implementing nested data 


global, /proc global /proc 
















local /proc 


local /proc 


Figure 5: The distributed /proc file system merges together the local /proc file systems so each node has a /proc that 


provides a global view of the system. 
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abstractions and multiple levels of subclasses to provide 
specialization. Unfortunately, this can result in a very 
deep call graph, with the associated performance 
penalty. We are currently restructuring the implementa- 
tion of our transport layer, where performance suffers 
due to this problem. 


5 Performance measurements 


Tables 4 and 5 give performance measurements of the 
Solaris MC process management implementation.” 
These measurements were taken on a cluster of four 
two-processor SPARCstation 10’s, three with 50 MHz 
processors and one with 40 MHz processors, each with 
64 MB and running Solaris 2.6 with Solaris MC modifi- 
cations. The interconnect is a 10 MByte/sec 100-Base T 
using the SunFast interface, and the transport layer uses 
the STREAMS stack. Note that Solaris MC is a research 
prototype and has not been tuned for performance. 
Thus, these numbers should be viewed as very prelimi- 
nary and not a measure of the full potential of the 
system. 


5.1 Micro-benchmarks 


Table 4 gives several performance measurements 
comparing the unmodified Solaris OS, the Solaris MC 
OS performing local operations, and the Solaris MC OS 
performing remote operations. 


The first three sets of measurements are from the Locus 
MicroBenchmarks (TNC Performance Test Suite). The 
first line shows the time for a process to fork, the child 
to exit, and the parent to wait on the child. There is 
some slowdown in Solaris MC due to the additional 
overhead of going through the globalization layer and 
creating the virtual process object. The second line 
shows the time taken for a process that does multiple 
execs, either on a single node or between nodes. The 
Solaris MC overhead from local execs is minimal, since 
the virtual process object remains unchanged. There is 


additional overhead, however for a remote exec, due to 


2. The programs used for these tests are available from 
http://www.sunlabs.com/research/solaris-mc/ 
process. htinl. 


Solaris 
fork/exit (TNC #1) 
(r)exec (TNC #5,#6) 


fork/(r)exec/exit (TNC #7, #8) 
PIOCSTATUS ioctl on /proc 
Signals 


the process state (open file descriptors, etc.) that must be 
transmitted across the network, and the virtual process 
object that must be created on the remote node. The 
third line shows performance for a process that performs 
a cycle of fork, exec (local or remote), and exit. 
Again, Solaris MC has some overhead even for the local 
case due to the fork, and much more overhead for the 
remote exec. These measurements show that Solaris 
MC could use tuning of the virtual process object to 
improve fork performance, and remote exec perfor- 
mance is limited by the inter-node communication cost, 
but the globalization layer adds negligible cost to a local 
exec, 


The next measurements show performance of the 
PIOCSTATUS ioctl onthe /proc file system, which 
returns a 508 byte structure containing process status 
information. This measurement illustrates the perfor- 
mance of the globalized /proc file system and of the 
cross-node ioct1 data copying. In the local case, there 
is a Slowdown of about 30ps due to the overhead of 
going through the /proc globalization code and the 
PXFS ioctl layer. The remote case is considerably 
slower due to the network traffic. As discussed in 
Section 4.6, the copyout of ioct1 data to user level is 
done through modified low-level copy functions. Thus, 
two network round-trips are required: one to invoke the 
remote ioct1, and a second to send the data back. Orig- 
inally, an address-space object was created for every 
ioctl to handle data copying, but the overhead of 
object creation made even local ioct1s take about | ms, 
an unacceptable overhead. However, by using a per- 
node object rather than a per-operation object, and by 
using the standard copy functions for local ioct1s 
rather than the modified ones, the overhead was substan- 
tially reduced. 


The final measurements in Table 4 shows performance 
for signalling between processes. In this test, two 
processes are started and send signals back and forth: 
the parent process sends a USR1 signal to the child 
process, the child process catches the signal and sends a 
USR1 signal to the parent, the parent catches the signal, 
and the cycle repeats. The time given is the average one- 
way time (i.e. half the cycle time). Again there is some 


Solaris MC Solaris MC, remote 





Table 4: Performance of various operations performed on standard Solaris, on Solaris MC locally, and on Solaris MC 


across nodes. 
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Solaris, local UFS file system, local processes 
Solaris, remote NFS file system, local processes 
Solaris MC, local UFS file system, local processes 


Solaris MC, local PXFS, local processes 
Solaris MC, remote PXFS, local processes 
Solaris MC, PXFS, processes ninning throughout cluster 


Table 5: 


[Operating system, file system, process location Sequential Parallel 





26 sec (33 sec) 
36 sec (38 sec) 
26 sec (32 sec) 
27 sec (31 sec) 
29 sec (35 sec) 
36 sec 


18 sec (22 sec) 
23 sec (27 sec) 
20 sec 
19 sec 
21 sec 
15 sec 


Times for the compilation phase of the modified Andrew benchmark. Numbers in parentheses were 


measured on a 40 MHz node. The parallel measurements in the first five rows run multiple processes on the 
two processors on a node; the final paralle] measurement runs processes on all nodes in the cluster. 


slowdown for the local case in Solaris MC and a consid- 
erable overhead for between-node communication. 


5.2 Parallel make performance 


One interesting measure is how well system perfor- 
mance scales for parallel tasks on multiple nodes. Table 
5 gives performance measurements for the system on 
the compilation phase of the modified Andrew bench- 
mark [17], compiling a sequence of files sequentially 
and in parallel across the cluster under various condi- 
tions. The first two lines show compilation time for a 
single node running Solaris, accessing the files from a 
local disk or NFS, and compiling with sequential make 
or parallel make (which takes advantage of the dual 
processors on a node). Unfortunately, one of the nodes 
had 40 MHz processors, while the others had 50 MHz 
processors, which makes cluster measurements slightly 
harder to interpret; numbers in parentheses show results 
on the slower node. The next three lines show perfor- 
mance of a single node running Solaris MC, using a 
local disk running the UNIX file system, a local disk 
running the Solaris MC proxy file system (PXFS), or a 
remote node running the Solaris MC file system. 
Finally, the last line shows performance of Solaris MC 
when processes are executed around the cluster, either 
sequentially or in parallel. 


Table 5 shows a speedup for parallel compilation on the 
Solaris MC cluster compared to a single Solaris node 
(15 sec vs. 18 sec), but this speedup isn’t as dramatic as 
one would expect when going from 2 processors to 8 
processors. Several factors account for this. First, the 
benchmark compile ends with a link phase that must be 
done sequentially and takes about 5 seconds. Thus, even 
with perfect scaling, the entire compile would take 
about 8 seconds. Second, there is a performance penalty 
in accessing PXFS files from a remote node (although 
comparing the PXFS measurements with the Solaris 
measurements shows that locally, PXFS is comparable 
to the native file system, and remotely it is faster than 
NFS). Thus, the compile is slowed down due to cross- 


node file accesses. Third, the node with 40 MHz proces- 
sors slows down the processes that are migrated to that 
node. Fourth, the parallel make uses a simple round- 
robin allocation policy, which yields non-optimum load 
balancing, especially when there is one slow node. 
Finally, comparing the last two lines of Table 5 shows 
that there is a significant performance penalty when 
processes are executed on remote nodes (increasing the 
time from 29 to 36 seconds for sequential execution). 
The effects of these factors can be reduced, however. 
Most significantly, with a faster interconnect and tuning 
of the object transport layer, the overheads due to 
communication between nodes will be reduced, 
improving file system and remote execution perfor- 
mance. With a faster interconnect and a balanced 
system, the performance improvement from parallel 
execution would be considerably better. 


5.3. Performance discussion 


Note that the Solaris operating system has been exten- 
sively tuned for performance, while Solaris MC is an 
almost entirely untuned research system. We expect 
these numbers to improve significantly with tuning, but 
there will still be some performance penalty due to the 
longer code path through the globalization layer. 


Remote process operations are considerably slower than 
local operations because of the additional network trans- 
port time. The performance of distributed process 
management depends strongly on the performance of 
the underlying object system and the cluster network. 
We are in the process of optimizing the Solaris MC 
object framework to provide faster inter-node communi- 
cation. 


6  ~Related work 


Several research and commercial operating systems 
provide distributed process management, such as Unisys 
Opus (Chorus) [4], Intel XP/S MP Paragon, OSF/1 AD 
TNC (Mach)[25], DCE TCF (Locus)[20], Sprite [7], 
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GLUnix [22], and MOSIX [3]. Solaris MC uses many of 
the concepts from these systems. 


Process management in the Solaris MC operating 
system differs from previous systems in several ways. 
First, many of the previous systems build distributed 
process management from scratch; the Solaris MC OS 
demonstrates how distributed process management can 
be added to an existing commercial kernel while mini- 
mizing kernel changes. On the other hand, the Solaris 
MC OS provides a stronger single-system image than 
systems such as GLUnix, which build a globalization 
layer at user level on top of an existing kernel. 


The Solaris MC OS also differs from previous systems 
in that it is built on an object-oriented communication 
framework, rather than a RPC-based framework. One 
key difference is that the object-oriented framework 
transparently routes invocations to the local or remote 
node as necessary, compared to RPC-based systems 
which require explicit marshalling of arguments and 
calling to a particular node. In addition, the object- 
oriented framework provides a built-in object reference 
counting mechanism; this avoids ad hoc mechanisms 
that are typically required in an RPC-based system to 
clean up state after node failures. The Solaris MC OS 
also illustrates how object-oriented programming can be 
added to an existing monolithic kernel. 


The Solaris MC OS also presents new object-oriented 
interfaces to the process subsystem. These interfaces 
may be useful to applications that require more control 
over the process subsystem. 


Finally, the Solaris MC OS shows how the /proc file 
system can be extended to a cluster to provide file access 
to process state throughout a cluster. Unlike the 
VPROCS [24] implementations of /proc, Solaris MC 
uses object inheritance to provide a distributed /proc 
through subclassing of the PXFS file system implemen- 
tation. 


7 Conclusions 


Process management in the Solaris MC research oper- 
ating system provides a distributed view of processes 
with a single pid space across a cluster of machines, 
while preserving POSIX semantics. It supports the stan- 
dard UNIX process operations and the Solaris /proc 
file system as well as providing remote execution. 


Process management is implemented in an object frame- 
work, with objects corresponding to each process, 
process group, and system node. This object framework 
simplified implementation of process management by 
providing transparent communication between nodes, 
failure notification, and reference counting. 
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Unlike the file system with the vnode interface, process 
management in UNIX generally lacks an interface for 
extending the system. While the /proc interface 
provides some control over processes, it is limited to 
status and process control operations. Process manage- 
ment in the Solaris MC operating system extends the 
access to the operating system’s process internals by 
providing an object-oriented interface. In the future, the 
/proc interface and the object-oriented interface could 
be merged to provide a single extensible interface to the 
operating system’s process management. 


Process management was designed to allow most local 
operations to take place without network communica- 
tion; there is no central server that must be contacted for 
process operations. There is some performance penalty 
due to the overhead of the globalization layer and due to 
creation of the virtual process object. With tuning, 
however, we expect this overhead to be reduced. Remote 
operations suffer a performance penalty due to the inter- 
connect bandwidth and the overhead of the Solaris MC 
transport layer. 


The main components in Solaris MC process manage- 
ment remaining to be implemented are process migra- 
tion and load balancing, although we have remote 
process execution. Support for failure recovery and for 
full process group semantics also need to be imple- 
mented. 


In conclusion, process management in the Solaris MC 
operating system illustrates how an existing monolithic 
operating system can be extended, with relatively few 
kernel changes, to provide transparent clusterwide 
process management. It also provides a set of object- 
oriented interfaces that can be used to provide access to 
the process internals. 


Acknowledgments 


This work would not have been possible without the 
work of the entire Solaris MC team in building Solaris 
MC. The author is grateful to Scott Wilson, Yousef 
Khalidi, the anonymous referees and especially the 
paper “shepherd” Clem Cole for their helpful comments 
on the paper. 


References 
[1] AT&T, Research Unix Version 8, Murray Hill, NJ. 


(2] M. J. Bach, “Distributed file systems”, The Design of 
the UNIX Operating System,’ Prentice-Hall, Englewood 
Cliffs, NJ, 1986. 


(3] A. Barak, S. Guday, and R. G. Wheeler, “The 
MOSIX Distributed Operating Systems,” Lecture Notes 


USENIX Association 


USENIX Association 


in Computer Science 672, Springer-Verlag, Berlin, 
1993. 


[4] N. Batlivala, er al., “Experience with SVR4 Over 
CHORUS,” Proceedings of USENIX Workshop on Micro- 
kernels & Other Kernel Architectures, April 1992. 


(S] J. Bernabeu, V. Matena, and Y. Khalidi, “Extending 
a Traditional OS Using Object-Oriented Techniques”, 
2nd Conference on Object-Oriented Technologies and 
Systems (COOTS), June 1996. 


[6] C. T. Cole, P. B. Flinn, and A. Atlas, “An Implemen- 
tation of an Extended File System for UNIX,” Proceed- 
ings of Summer USENIX 1985, pp. 131-144. 


[7] F. Douglis and J. Ousterhout, “Transparent Process 
Migration: Design Altermatives and the Sprite Imple- 
mentation,” Software—Practice & Experience, vol. 
21(8), August 1991. 


(8] Fred Douglis, John K. Ousterhout, M. Frans 
Kaashoek, and Andrew S. Tanenbaum, “A Comparison 
of Two Distributed Systems: Amoeba and Sprite,” Com- 
puting Systems, 4(4):353-384, Fall 1991. 


(9] R. Faulkner and R. Gomes, “The Process File Sys- 
tem and Process Model in UNIX System V,” Proceedings 
of Winter Usenix 199]. 


[10] G. Hamilton, M. L. Powell, and J. G. Mitchell, 
“Subcontract: A Flexible Base for Distributed Program- 
ming,” Symposium on Operating System Principles, 
1993, pp. 69-79. 


{1 1] M. Harchol-Balter and A. B. Downey, “Exploiting 
Process Lifetime Distributions for Dynamic Load Bal- 
ancing,” Proceedings of ACM Sigmetrics '96 Conference 
on Measurement and Modeling of Computer Systems, 
May 23-26 1996, pp 13-24. 


[12] J. Howard, M. Kazar, S. Menees, D. Nichols, M. 
Satyanarayanan, R. Sidebotham, and M. West, “Scale 
and Performance in a Distributed File System”, ACM 
Transactions on Computer Systems, 6(1), February 
1988, pp. 51-81. 


[13] IEEE, /EEE Standard for Information Technology 
Portable Operating System Interface, YEEE Std 
1003.1 b-1993, April 1994. 


[14] Y. Khalidi, J. Bemabeu, V. Matena, K. Shirriff, M. 
Thadani. “Solaris MC: A Multicomputer Operating Sys- 
tem”, Proceedings of Usenix 1996, January 1996, pp. 
191-203. 


[15] Steven R. Kleiman, “Vnodes: An Architecture for 
Multiple File System Types in Sun UNIX”, Proceedings 
of Summer USENIX Conference 1986, pp. 238-247. 


1997 Annual Technical Conference 


[16] Object Management Group, The Common Object 
Request Broker: Architecture and Specification, Revi- 
sion 1.2, December 1993. 


(17] J. Ousterhout, “Why Aren’t Operating Systems 
Getting Faster As Fast As Hardware?” Proceedings of 
Summer USENIX 1990, pp. 247-256 


(18] J. Ousterhout, A. Cherenson, F. Douglis, M. Nel- 
son, and B. Welch, “The Sprite Network Operating Sys- 
tem,” JEEE Computer, February 1988. 


[19] R. Pike, D. Presotto, S. Dorward, B. Flandrena, K. 
Thompson, H. Trickey., and P. Winterbottom, “Plan 9 


From Bell Labs,” Computing Systems, Vol. 8, No. 3, 
Summer 1995, pp. 221- 254. 


(20] G. Popek and B. Walker, The LOCUS Distributed 
System Architecture, MIT Press, 1985. 


(21] Sun Microsystems, Inc. JDL Programmer’s Guide, 
1992. 


(22] Amin M. Vahdat, Douglas P. Ghormley, and Tho- 
mas E. Anderson, Efficient, Portable, and Robust Exten- 
sion of Operating System Functionality, UC Berkeley 
Technical Report CS-94-842, December, 1994. 


(23] B. Walker, J. Lilienkamp, J. Hopfield, R. Zajcew, 
G. Thiel, R. Mathews, J. Mott, and F. Lawlor, “Extend- 
ing DCE to Transparent Processing Clusters,” UniForum 
1992 Conference Proceedings, pp 189-199. 


(24] B. Walker, R. Zajcew, G. Thiel, VPROCS: A Virtual 
Process Interface for POSIX systems, Technical Report 
LA-0920, Locus Computing Corporation, May, 1992. 


(25] Roman Zajcew, ef al., “An OSF/1 UNIX for Mas- 
sively Parallel] Multicomputers,” Proceedings of Winter 
USENIX Conference 1993. 


[26] R. Sandberg, D. Goldberg, D. Walsh, and B. Lyon, 
“The Design of the Sun Network File System,” Proceed- 
ings of Summer USENIX 1985, pp 119-131. 





131 


USENIX Association 


Adaptive and Reliable Parallel Computing 
on Networks of Workstations 


Robert D. Blumofe 
Department of Computer Sciences 
The University of Texas at Austin 
Austin, Texas 78712 
rdb@cs.utexas.edu 


Philip A. Lisiecki 
MIT Laboratory for Computer Science 
545 Technology Square 
Cambridge, Massachusetts 02139 
lisiecki@mit.edu 


October 21, 1996 


Abstract 


In this paper, we present the design of Cilk-NOW, a 
runtime system that adaptively and reliably executes 
functional Cilk programs in parallel on a network of 
UNIX workstations. Cilk (pronounced “‘silk”) is a par- 
allel multithreaded extension of the C language, and all 
Cilk runtime systems employ a provably efficient thread- 
scheduling algorithm. Cilk-NOW is such a runtime sys- 
tem, and in addition, Cilkk-NOW automatically delivers 
adaptive and reliable execution for a functional subset 
of Cilk programs. By adaptive execution, we mean that 
each Cilk program dynamically utilizes a changing set of 
otherwise-idle workstations. By reliable execution, we 
mean that the Cilk-NOW system as a whole and each ex- 
ecuting Cilk program are able to tolerate machine and 
network faults. Cilk-NOW provides these features while 
programs remain fault oblivious, meaning that Cilk pro- 
grammers need not code for fault tolerance. Through- 
out this paper, we focus on end-to-end design decisions, 
and we show how these decisions allow the design to ex- 
ploit high-level algorithmic properties of the Cilk pro- 
gramming model in order to simplify and streamline the 
implementation. 


1 Introduction 


A strong case argues for the use of networks of work- 
stations (NOWs) as parallel-computation platforms [3], 
and Cilk-NOW [6] is a software system that has been 
designed and implemented to run parallel programs eas- 
ily and efficiently on networks of UNIX workstations. 
Implemented entirely in user-level software on top of 
UNIX, Cilk-NOW is a runtime system for a functional 
subset of the parallel Ci/k language [6, 8, 26], a mul- 
tithreaded extension of C. Applications written in Cilk 

This research was supported in part by the Advanced Re- 
search Projects Agency (ARPA) under Grants NOO0I4-94-1 -0985 and 
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ARPA High-Performance Computing Graduate Fellowship. 


include graphics rendering, backtrack search, protein 
folding (37], and the «Socrates chess program [25] 
which won second prize at the 1995 ICCA World Com- 
puter Chess Championship running on the 1824-nodeIn- 
tel Paragon at Sandia National Labs. Like all runtime 
systems for Cilk, Cilk-NOW schedules threads using a 
provably efficient algorithm based on the technique of 
random “work stealing” [6, 9] in which processors with 
no threads steal threads from victims chosen at random. 
With this algorithm, Cilk delivers performance that is 
guaranteed to be both efficient and predictable [6, 8]. In 
addition to thread scheduling, Cilk-NOW also performs 
macroscheduling [30]. That is, Cilk-NOW automatically 
identifies idle workstations and assigns those idle work- 
stations to help out with running Cilk programs. 

The Cilk-NOW runtime system is designed to execute 
Cilk programs efficiently in the highly dynamic envi- 
ronment of a NOW. Figure 1|(a) plots the number of 
machines that were idle! at each point in time over the 
course of a typical week for a network of 50 SPARCsta- 
tions at the MIT Laboratory for Computer Science. As 
can be seen from this plot, though more machines are 
idle at night, a significant number of machines are idle 
at various times throughout the day. Therefore, by adap- 
tively using idle machines both day and night, we can 
take advantage of significantly more machine resources 
than if we run our parallel jobs as batch jobs during the 
night. Figure I(b) is a histogram giving the total idle 
processor-hours broken down by idle time-interval, from 
this experiment. This histogram shows that a significant 
percentage of idle time (1104 processors-hours, or 19.1% 
of the total 5776 processor-hours) comes from machines 
that are idle for less than 30 minutes at a time. Thus, the 
efficient exploitation of idle machines requires that ma- 


'For this experiment, a machine is idle if the keyboard and mouse 
have not been touched for 15 minutes and the 1, 5, and 15 minute 
processor load averages are below 0.35, 0.30, and 0.25 respectively. 
These load-average thresholds are reasonable but also somewhat arbi- 
trary. 
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Figure 1: (a) This plot shows the number of machines, out of the 50 machines in our network, that were idle at each point in time 
over the course of one typical week in March, 1995. (b) This histogram shows the number of idle processor-hours broken down 
by idle time-interval. When a machine remains idle for a period of t hours, it contributes ¢ hours to the height of the bar plotted at 


position ¢ rounded up to the nearest 10 minutes. 


chines are able to join and leave a computation quickly 
and without human intervention. These observations are 
consistent with those of others [5, 20, 27, 28, 31]. 

Cilk-NOW provides the following features for running 
Cilk programs on a network of workstations. 


Ease of use. A user can run a Cilk program in parallel 
on a NOW as if the program were only being run 
on the local workstation. The user simply types the 
program’s command line, and then the Cilk-NOW 
runtime system automatically schedules the execu- 
tion of the program in parallel across the network. 


Adaptive parallelism. The Cilk-NOW system adap- 
tively executes Cilk programs on a dynamically 
changing set of otherwise-idle workstations [6, 10]. 
When a given workstation is not being used by its 
owner, the workstation automatically joins in and 
helps out with the execution of a Cilk program. 
When the owner returns to work, the machine au- 
tomatically retreats from the Cilk program. 


Fault tolerance. The Cilk-NOW runtime system auto- 
matically performs checkpointing, detects failures, 
and performs recovery [6] while Cilk programs 
themselves remain fault oblivious. That is, Cilk- 
NOW provides fault tolerance without requiring 
that programmers code for fault tolerance. 


Flexibility. The Cilk-NOW system allows the condi- 
tions that are used to determine the idleness of 
workstations to be set dynamically, in accordance 
with the tastes of the users and the owners of the ma- 
chines whose cycles are being stolen. This flexibil- 
ity preserves the sovereignty of each workstation’s 
owner which is essential to ensure that owners are 
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willing to contribute their workstations for use by 
others. 


Security. The Cilk-NOW system uses secure protocols 
that do not open a workstation to unauthorized users 
running foreign code on a machine. The desired de- 
gree of security is that which a given system uses to 
authenticate its remote execution protocol. 


Guaranteed performance. The Cilk-NOW system ex- 
ecutes Cilk programs using a work-stealing sched- 
uler. This scheduler delivers performance that 
can be predicted accurately with a simple abstract 
model [6, 8]. Moreover this simple model can be 
adapted to the case of heterogeneous processors and 
networks {32]. 


Recently, we ran a Cilk protein-folding application 
pfold [37] using Cilk-NOW on a network of about 
50 Sun SPARCstations connected by shared 10-Mb/s 
Ethernet to solve a large-scale protein-folding problem. 
The program ran for 9 days, surviving several machine 
crashes and reboots, utilizing 6566 processor-hours of 
otherwise-idle cycles. with no administrative effort on 
our part (besides typing pfold at the command-line to 
begin execution), while other users of the network went 
about their business unaware of the program’s presence. 

It is important to note that Cilk-NOW provides these 
features only for Cilk-2 programs which are essentially 
functional. Cilk-NOW does not support more recent ver- 
sions of Cilk (Cilk-3 and Cilk-4) that incorporate virtual 
shared memory, and in particular, Cilk-NOW does not 
provide any kind of distributed shared memory. In addi- 
tion, Cilk-NOW does not provide fault tolerance for its 
V/O facility. 
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In this paper, we present the design of Cilk-NOW, fo- 
cusing on those features of Cilk-NOW that are partic- 
ular to the NOW environment. The Cilk-2 language, 
work-stealing scheduler, MPP implementation, and guar- 
anteed performance model have been covered at length 
in other papers [6, 8, 9, 26]. In this paper, we shall fo- 
cus on adaptive parallelism and fault tolerance. Specifi- 
cally, we will show how Cilk-NOW’s end-to-end design 
[38] leverages algorithmic properties of the Cilk pro- 
gramming model and work-stealing scheduler in order 
to amortize all the overhead of adaptive parallelism and 
fault tolerance against the analytically and empirically 
bounded overhead of Cilk’s work-stealing scheduler. 

The remainder of this paper is organized as follows. 
In Section 2 we review the Cilk-2 language and work- 
stealing scheduler as first introduced in [8]. In Section 3 
we describe the architecture of a Cilk job executing un- 
der the Cilk-NOW runtime system. Then, in Section 4 
we explain how Cilk-NOW implements adaptive paral- 
lelism, and in Section 5 we explain how Cilk-NOW per- 
forms checkpointing, fault detection, and fault recovery. 
In Section 6 we describe the Cilk-NOW macrosched- 
uling system architecture. In Section 7 we compare the 
Cilk-NOW system to related work. Finally, in Section 8 
we outline plans for future work, and we conclude. 


2 The Cilk language and work- 
stealing scheduler 


In this section we overview the Cilk parallel mul- 
tithreaded language and its runtime system’s work- 
stealing scheduler [6, 8, 26]. For brevity, we shall not 
present the entire Cilk language, and we shall omit some 
details of the work-stealing algorithm. Since Cilk-2 
forms the basis for the Cilk-NOW system, we shall fo- 
cus on the Cilk-2 language and on the Cilk-2 runtime 
system as implemented without adaptive parallelism or 
fault tolerance. 

A Cilk program contains one or more Cilk proce- 
dures, and each Cilk procedure contains one or more Cilk 
threads. A Cilk procedure is the paralle] equivalent of a 
C function, and a Cilk thread is a nonsuspending piece of 
a procedure. The Cilk runtime system manipulates and 
schedules the threads. The runtime system is not aware 
of the grouping of threads into procedures. Cilk proce- 
dures are purely an abstraction supported by the cil k2c 
type-checking preprocessor [33]. 

Consider a program that uses double recursion to com- 
pute the Fibonacci function. The Fibonacci function 
fib(n) for n > 0 is defined as 


n ifn < 2; 


fib(n) = fib(n — 1) + lib(n — 2) otherwise. 


thread Fib (cont int k, 
{i wf 


int n) 
(n<2) 
send.argument (k, 
else 
eont: int x; y; 
spawn.next Sum (k, 
spawn Fib (x, n-1); 
spawn Fib (y, n-2); 


n); 


2X, 2y0)4 


thread Sum (cont int k, int x, 
{ send.argument (k, x+y); 


} 


Figure 2: A Cilk procedure to compute the nth Fibonacci 
number. This procedure contains two threads, Fib and Sum. 


int y) 





Figure 2 shows how this function is written as a Cilk pro- 
cedure consisting of two Cilk threads: Fib and Sum. 
While double recursion is a terrible way to compute 
Fibonacci numbers, this toy example does illustrate a 
common pattern occurring in divide-and-conquer appli- 
cations: recursive calls solve smaller subcases and then 
the partial results are merged to produce the final result. 

A Cilk thread generates parallelism at runtime by 
spawning a child thread that is the initial thread of a 
child procedure. A spawn is the parallel equivalent of 
a function call. A spawn differs from a call in that when 
a thread spawns achild, the parent and child may execute 
concurrently. After spawning one or more children, the 
parent thread cannot then wait for its children to return— 
in Cilk, threads never suspend. Rather, the parent thread 
must additionally spawn a successor thread to wait for 
the values “returned” from the children. The spawned 
successor is part of the same procedure as its predeces- 
sor. The child procedures return values to the parent pro- 
cedure by sending those values to the parent’s waiting 
successor. Thus, a thread may wait to begin executing, 
but once it begins executing, it cannot suspend. This 
style of interaction among threads is called continuation- 
passing style [4]. Spawning successor and child threads 
is done with the spawn_next and spawn keywords re- 
spectively. Sending a value to a waiting thread is done 
with the send_argument statement. The Cilk runtime 
system implements these primitives using two basic data 
structures: closures and continuations. 

Closures are data structures employed by the runtime 
system to keep track of and schedule the execution of 
spawned threads. Whenever a thread is spawned, the run- 
time system allocates a closure for it from a simple heap. 
A closure consists of a pointer to the code for that thread, 
a slot for each of the thread’s specified arguments, and a 
join counter indicating the number of missing arguments 
that need to be supplied before the thread is ready to run. 
The closure, or equivalently the spawned thread, is ready 
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Figure 3: The Fib thread spawns a successor and two children. For the successor, it creates a closure with 2 empty argument 
slots, and for each child, it creates a closure with a continuation referring to one of these empty slots. The background shading 


denotes Cilk procedures. 


if it has obtained all of its arguments, and it is waiting if 
some arguments are missing. To run a ready closure, the 
Cilk scheduler invokes the thread using the values in the 
closure as arguments. When the thread dies, the closure 
is freed. 

A continuation is a global reference to an empty argu- 
ment slot of a closure, implemented as a compound data 
structure containing a pointer to a closure and an offset 
that designates one of the closure’s argument slots. Con- 
tinuations are typed with the C data type of the slot in the 
closure. In the Cilk language, continuations are declared 
by the type modifier keyword cont. For example, the 
Fib thread declares two integer continuations, x and y. 

Using the spawn.next primitive, a thread spawns a 
successor thread by creating a closure for the successor. 
The successor thread is part of the same procedure as its 
predecessor. For example, in the Fib thread, the state- 
ment spawn_next Sum (k, ?x, ?y) allocates a 
closure with Sum as the thread and three argument slots, 
as illustrated in Figure 3. The first slot is initialized with 
the continuation k and the last two slots are empty. The 
continuation variables x and y are initialized to refer to 
these two empty slots, and the join counter is set to 2. 
This closure is waiting. 

Similarly, using the spawn primitive, a thread spawns 
a child thread by creating a closure for the child. The 
child thread is the initial thread of a newly spawned child 
procedure. The spawn statement is semantically iden- 
tical to spawn.next. For example, the Fib thread 
spawns two children as shown in Figure 3. The state- 
ment spawn Fib (x, n-1) allocates aclosure with 
Fib as the thread and two argument slots. The first slot 
is initialized with the continuation x which, as a conse- 
quence of the previous statement, refers to a slot in its 
parent’s successor closure. The second slot is initialized 
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with the value of n-1. The join counter is set to zero, so 
the thread is ready. 

An executing thread sends a value to a waiting thread 
by placing the value into an argument slot of the wait- 
ing thread’s closure. The send_argument statement 
sends a value to the empty argument slot of a waiting 
closure specified by its argument. The types of the con- 
tinuation and the value must be compatible. The join 
counter of the waiting closure is decremented, and if 
it becomes zero, then the closure is ready. For exam- 
ple, the statement send.argument (k, n) in Fib 
writes the value of n into an empty argument slot in the 
parent procedure’s waiting Sum closure and decrements 
its join counter. When the Sum closure’s join counter 
reaches zero, it is ready. When the Sum thread gets exe- 
cuted, it adds its two arguments, x and y, and then uses 
send-argument to “return” this result up to its parent 
procedure’s waiting Sum thread. 

At runtime, each processor maintains a “ready” deque 
(double-ended queue) which contains all of the ready 
closures. Whenever a closure is created, if its join 
counter is 0, then it is placed on the head of the ready 
deque. Whenever a send_argument call is made, the 
join counter is decremented, and if the join counter is 
decremented to zero, then the closure is placed on the 
head of the ready deque. When a thread finishes, the next 
thread to execute is chosen from the head of the ready 
deque. 

If no threads are available in the ready deque, a proces- 
sor engages in work stealing. To steal work, a processor, 
called the thief, chooses another processor, called the vic- 
tim, at random and requests a closure to be sent back. If 
that processor has any closures in its ready deque, one 
is removed from the tail of the victim’s ready deque and 
sent across the network to the thief, who will add this 
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closure to its own ready deque. The thief may then begin 
work on the stolen closure. If the victim has no ready 
closures, it informs the thief who then tries to steal from 
another random processor until a ready closure is found 
or program execution completes. 

This simple work-stealing scheduler has been shown, 
both analytically and empirically, to deliver efficient 
and predictable performance [6, 8, 9] for “well struc- 
tured” computations. A well structured computation 
is one in which each procedure sends values (with 
send_argument) only to its parent and only as the last 
action performed by its last thread. For well structured 
computations executing on any number P of processors, 
the execution time can be modeled accurately as T; /P + 
Too Where 7; denotes the work of the computation—that 
is, the execution time with 1 processor—and T,, denotes 
the critical-path length—that is, the theoretical execu- 
tion time on an ideal machine with infinitely many pro- 
cessors. Such performance is within a factor of 2 of opti- 
mal, and additionally when the critical path is short com- 
pared to the amount of work per processor, such perfor- 
mance displays linear speedup. 

The key element in proving this T, /P + To. perfor- 
mance bound is the fact that closures are always stolen 
from the tail of the ready deque. For well structured com- 
putations, a closure that is on the critical path must be at 
the tail of some processor’s ready deque. Thus, when 
processors are not executing closures, they are stealing 
work and, therefore, are likely to be making progress on 
the critical path. As acorollary to this result, the number 
of work-steal attempts per processor is proportional to 
the critical-path length and does not grow with the work. 
Thus, a computation with a sufficiently short critical path 
compared to the work per processor can continue to dis- 
play linear speedup even when communication is very 
expensive. This idea of amortizing overhead against the 
critical path plays an important role in our later discus- 
sion of adaptive parallelism and fault tolerance. 


3  Cilk-NOW job architecture 


The Cilk-NOW runtime system consists of several com- 
ponent programs that (in addition to macroscheduling 
duties discussed later) manage the execution of each in- 
dividual Cilk program. In this section, we shall cover the 
architecture of a Cilk program as it is executed by the 
Cilk-NOW runtime system, explaining the operation of 
each component and their interactions. 

In Cilk-NOW terminology, we refer to an executing 
Cilk program as a Cilk job. Since Cilk programs are par- 
allel programs, a Cilk job consists of several processes 
running on several machines. One process, called the 
clearinghouse, in each Cilk job runs a system-supplied 
program called CilkChouse that is responsible for 
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keeping track of all the other processes that comprise 
a given job. These other processes are called workers. 
A worker is a process running the actual executable of a 
Cilk program. Since Cilk jobs are adaptively parallel, the 
set of workers is dynamic. At any given time during the 
execution of a job, a new worker may join the job or an 
existing worker may leave. Thus, each Cilk job consists 
of one or more workers and a clearinghouse to keep track 
of them. 

The Cilk-NOW runtime system contains additional 
components that perform macroscheduling as discussed 
in Section 6, but for the purpose of our present discus- 
sion, we need only introduce the “node managers.” A 
node manager is a process running a system-supplied 
program called CilkNodeManager. A node manager 
runs as a background daemon on every machine in the 
network. It continually monitors its machine to deter- 
mine when the machine is idle. 

To see how all of these components work together 
in managing the execution of a Cilk job, we shall run 
through an example. (In describing interactions with 
the macroscheduler, we shall refer to the macrosched- 
uler as a single entity, though actually, as we shall see 
in Section 6, the macroscheduler is a distributed subsys- 
tem with several components.) Suppose that a user sits 
down at a machine called Penguin to run the pfold 
program. In our example, the user types 


pfold 3 7 


at the shell, thereby launching a Cilk job to enumerate 
all protein foldings using 3 initial folding sequences and 
starting with the 7th one. 

The new Cilk job begins execution as illustrated in 
Figure 4(a). The new process running the pfold exe- 
cutable is the first worker and begins execution by fork- 
ing a clearinghouse with the command line 


CilkChouse -- pfold 3 7. 


Thus, the clearinghouse knows that it is in charge of a job 
whose workers are running “pfold 3 7.” The clear- 
inghouse begins execution by sending a job description 
to the macroscheduler. The job description is a record 
containing several fields. Among these fields is the name 
of the Cilk program executable—in this case pEold— 
and the clearinghouse’s network address. The clearing- 
house then goes into a service loop waiting for messages 
from its workers. After forking the clearinghouse, the 
first worker registers with the clearinghouse by sending it 
a message containing its own network address. Now the 
clearinghouse knows about one worker, and it responds 
to that worker by assigning it a unique name. Workers 
are named with numbers, starting with number 0. Hav- 
ing registered, worker 0 begins executing the Cilk pro- 
gram as described in Section 2. We now have a running 
Cilk job with one worker. 
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(c) A no-longer-idle machine leaves the job. 


Figure 4: (a) The first worker forks a clearinghouse, and then the clearinghouse submits the job to the macroscheduler. (b) When 
the node manager detects that its machine is idle, it obtains a job from the macroscheduler and then forks a worker. The worker 
registers with the clearinghouse and then begins work stealing. (¢) When the node manager detects that its machine is no-longer 
idle, it sends a kill signal to the worker. The worker catches this signal, offloads its work to other workers, unregisters with the 


clearinghouse, and then terminates. 


A second worker joins the Cilk job when some other 
workstation in the network discovers that it is idle, as 
illustrated in Figure 4(b). Suppose the node manager 
on a machine named Sparrow detects that the ma- 
chine is idle. The node manager sends a message to the 
macroscheduler, and the macroscheduler responds with 
the job description of a Cilk job for the machine to work 
on. In this case, the job description specifies our pf old 
job by giving the name of the executable—pf£old—and 
the network address of the clearinghouse. The node man- 
ager then uses this information to fork a new worker as a 
child with the command line 


pfold ~NoChouse 
-Address=clearinghouse-address 


The -NoChouse flag on the command line tells the 


worker that it is to be an additional worker in an already 
existing Cilk job. (Without this flag, the worker would 
fork a new clearinghouse and start a new Cilk job.) The 
~Address field on the command line tells the worker 
where in the network to find the clearinghouse. The 
worker uses this address to send a registration message, 
containing its own network address, to the clearinghouse. 
The clearinghouse responds with the worker’s assigned 
name—in this case, number 1—and the job’s command- 
line arguments—in this case, ““pfold 3 7.” Addition- 
ally, the clearinghouse responds with alist of the network 
addresses of all other registered workers. Now the new 
worker knows the addresses of the other workers, so it 
can commence execution of the Cilk program and steal 
work as described in Section 2. We now have a running 
Cilk job with two workers. 
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Now, suppose that someone touches the keyboard on 
Sparrow. In this case, the node manager detects that 
the machine is busy, and the machine leaves the Cilk job 
as illustrated in Figure 4(c). After detecting that the ma- 
chine is busy, the node manager sends a kill signal to its 
child worker. The worker catches this signal and pre- 
pares to leave the job. First, the worker offloads all of its 
closures to other workers as explained in more detail in 
Section 4. Next, the worker sends a message to the clear- 
inghouse to unregister. Finally, the worker terminates. 

When a Cilk job is running, each worker periodi- 
cally checks in with the clearinghouse. Specifically, each 
worker periodically (every 2 seconds) sends a message to 
the clearinghouse, and the clearinghouse responds with 
an update message informing the worker of any other 
workers that have left the job and any new workers that 
have joined the job. For each new worker that has joined, 
the clearinghouse also provides the network address. If 
the clearinghouse does not receive any messages from 
a given worker for an extended period of time (30 sec- 
onds), then the clearinghouse determines that the worker 
has crashed. In later update messages, the clearinghouse 
informs the other workers of the crash, and the other 
workers take appropriate remedial action as described in 
Section 5. 

All communication between workers, and between 
workers and the clearinghouse, is implemented with 
UDPAIP [13, 40]. Knowing that UDP datagrams are un- 
reliable, the Cilk-NOW protocols incorporate appropri- 
ate mechanisms, such as acknowledgments, retries, and 
timeouts, to ensure correct operation when messages get 
lost. We shall not discuss these mechanisms in any detail, 
and in order to simplify our exposition of Cilk-NOW, we 
shall often speak of messages being sent and received as 
if they are reliable. What we will say about these mecha- 
nisms is that they are built on top of UDP but without any 
effort to create a reliable message-passing layer. Rather 
these mechanisms are built directly into the runtime sys- 
tem’s protocols, so in the common case when a message 
does get through, Cilk-NOW pays no overhead to make 
the message reliable. 

We chose to build Cilk-NOW’s communication pro- 
tocols using an unreliable message-passing layer instead 
of a reliable one for two reasons, both based on cnd-to- 
end design arguments [38]. First, reliable layers such as 
TCP/IP {40} and PVM [41] perform implicit acknowl- 
edgments and retries to achieve reliability. Therefore, 
such layers either preclude the use of asynchronous com- 
munication or require extra buffering and copying. A 
layer such as UDP which provides minimal service guar- 
antees can be implemented with considerably less soft- 
ware overhead than a layer with more service features. 
In the common case when the additional service is not 
needed, the minimal layer can easily outperform its fully- 


featured counterpart. Second, in an environment where 
machines can crash and networks can break, the notion 
of a “reliable” message-passing layer is somewhat sus- 
pect. A runtime system operating in an inherently un- 
reliable environment cannot expect the message-passing 
layer to make the environment reliable. Rather, the run- 
time system must incorporate appropriate mechanisms 
into its protocols to take action when a communication 
endpoint or link fails. For these reasons, we chose to 
build the Cilk-NOW runtime system on top of a minimal 
layer of message-passing service and incorporate mech- 
anisms directly into the runtime system’s protocols in or- 
der to handle issues of reliability. The downside to this 
approach is complexity. The protocols implemented in 
the Cilk-NOW runtime system are complex: the code 
for these protocols takes almost 20 percent of the total 
runtime-system code, and the programming effort was 
probably near half of the total. Nevertheless, this was 
a one-time effort that we expect will reap performance 
rewards for a long time to come. 


4 Adaptive parallelism 


Adaptive parallelism allows a Cilk job to take advan- 
tage of idle machines whether or not they are idle when 
the job starts and whether or not they will remain idle 
for the duration of the job. In order to efficiently uti- 
lize machines that may join and leave a running job, the 
overhead of supporting this feature must not excessively 
slow down the work of any worker at a time when it is 
not joining or leaving. As we saw in the previous sec- 
tion, a new worker joins a job easily enough by register- 
ing with the clearinghouse and then stealing a closure. 
A worker leaves a job by migrating all of its closures 
to other workers, and here the danger lies. When we 
migrate a waiting closure, other closures with continu- 
ations that refer to this closure must somehow update 
these continuations so they can find the waiting closure 
at its new location. (Without adaptive parallelism, wait- 
ing closures never move.) Naively, each migrated wait- 
ing closure would have to inform every other closure of 
its new location. In this section, we show how we can 
take advantage of Cilk’s well structuring and the work- 
stealing scheduler to make this migration extremely sim- 
ple and efficient. (Experimental results documenting the 
efficiency of Cilk-NOW’s adaptive parallelism have been 
omitted for lack of space but can be found in [6].) 

Our approach is to impose additional structure on the 
organization of closures and continuations, such that the 
structure is cheap to maintain while simplifying the mi- 
gration of closures. Specifically, we maintain closures 
in “subcomputations” that migrate en masse, and every 
continuation in a closure refers to a closure in the same 
subcomputation. In order to send a value from a clo- 


1997 Annual Technical Conference 


139 


140 


sure in one subcomputation to a closure in another, we 
forward the value through intermediate “result closures,” 
and give each result closure the ability to send the value 
to precisely one other closure in one other subcomputa- 
tion. With this structure and these mechanisms, all of the 
overhead associated with adaptive parallelism (other than 
the actual migration of closures) occurs only when clo- 
sures are stolen, and as we saw in Section 2, the number 
of steals grows at most linearly with the critical path of 
the computation and is not a function of the work. The 
bulk of this section’s exposition concerns the organiza- 
tion of closures in subcomputations and the implemen- 
tation of continuations. After covering these topics, the 
mechanism by which closures are migrated to facilitate 
adaptive parallelism is quite straightforward. 


In Cilk-NOW, every closure is maintained in one of 
three pools associated with a data structure called a sub- 
computation. A subcomputation is a record contain- 
ing (among other things) three pools of closures. The 
ready pool is the list of ready closures described in 
Section 2. The waiting pool is a list of waiting clo- 
sures. The assigned pool is a list of ready closures 
that have been stolen away. Program execution begins 
with one subcomputation—the root subcomputation— 
allocated by worker 0 and containing a single closure— 
the initial thread of ci 1 k1nain—in the ready pool. In 
general, a subcomputation with any closures in its ready 
pool is said to be ready, and ready subcomputations can 
be executed by the scheduler as described in Section 2 
with the additional provision that each waiting closure is 
kept in the waiting pool and then moved to the ready pool 
when its join counter decrements to zero. The assigned 
pool is used in work stealing as we shall now see. 


The act of work stealing creates a new subcomputation 
on the thief which is linked to a copy of the stolen clo- 
sure kept in an assigned pool on the victim. If a worker 
needs to steal work, then before sending a steal request to 
a victim, it allocates a new subcomputation from a sim- 
ple runtime heap and gives the subcomputation a unique 
name. The subcomputation’s name is formed by con- 
catenating the thief worker’s name and a number unique 
to that worker. The first subcomputation allocated by a 
worker r is named r: 1, the second is named r: 2, and 
so on. The root subcomputation is named 0: 1. The steal 
request message contains the name of the thief’s newly 
allocated subcomputation. When the victim worker gets 
the request message. if ithas any ready subcomputations, 
then it chooses a ready subcomputation in round-robin 
fashion, removes the closure at the tail of the subcom- 
putation’s ready pool, and places this victim closure in 
the assigned pool. As illustrated in Figure 5, the victim 
worker then assigns the closure to the thief’s subcompu- 
tation by adding to the closure an assignment information 
record allocated from a simple runtime heap, and then 


1997 Annual Technical Conference 


storing the name of the thief worker and the name of the 
thief’s subcomputation (as contained in the steal request 
message) in the assignment information. Finally, the vic- 
tim worker sends a copy of the closure to the thief. When 
the thief receives the stolen closure, it records the name 
of the victim worker in its subcomputation, and it places 
the closure in the subcomputation’s ready pool. Now the 
thief’s subcomputation is ready, and the thief worker may 
commence executing it. Notice that the victim closure 
and thief subcomputation can refer to each other via the 
thief subcomputation’s name which is stored both in the 
victim closure’s assignment information and in the thief 
subcomputation, as illustrated in Figure 5. 

When a worker finishes executing a subcomputation, 
the link between the subcomputation and its victim clo- 
sure is destroyed. Specifically, when a subcomputation 
has no closures in any of its three pools, then the subcom- 
putation is finished. A worker with a finished subcompu- 
tation sends a message containing the subcomputation’s 
name to the subcomputation’s victim worker. Using this 
name, the victim worker finds the victim closure. This 
closure is removed from its subcomputation’s assigned 
pool and then the closure and its assignment informa- 
tion are freed. The victim worker then acknowledges 
the message, and when the thief worker receives the ac- 
knowledgment, it frees its subcomputation. When the 
root subcomputation is finished, the entire Cilkjobis fin- 
ished. 


In addition to allocating a new subcomputation, when- 
ever a worker steals a closure, it also allocates a new “‘re- 
sult” closure, and it alters the continuation in the stolen 
closure so that it refers to the result closure. Consider a 
thief stealing a closure, and suppose the victim closure 
contains a continuation referring to a closure that we call 
the target. (The victim and target closures must be in 
the same subcomputation in the victim worker.) Con- 
tinuations are implemented as the address of the target 
closure concatenated with the index of an argument slot 
in the target closure. Therefore, the continuation in the 
victim closure contains the address of the target closure, 
and this address is only meaningful to the victim worker. 
Thus, when the thief worker receives the stolen closure, 
it replaces the continuation with a new continuation re- 
ferring to an empty slot in a newly allocated result clo- 
sure. The stolen and result closures are part of the same 
subcomputation. The result closure’s thread is a special 
system thread whose operation we shall explain shortly. 
This thread takes one argument: a result value. The re- 
sult value is initially missing, and the continuation in the 
stolen closure is set to refer to this argument slot. The 
result closure is waiting and its join counter is 1. 

Using continuations to send values from one thread to 
another operates as described in Section 2, but when a 
value is sent to a result closure, communication between 
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Figure 5: A victim closure stolen from the subcomputation s : i of victim worker s is assigned to the thief subcomputation r: j. 
The victim closure is placed in the assigned pool and augmented with assignment information that records the name of the thief 
worker and the name of the thief subcomputation. The thief subcomputation records its own name and the name of the victim 
worker. Thus, the victim closure and thief subcomputation can refer to each other via the thief subcomputation's name. 


different subcomputations occurs. When a result closure 
receives its result value, it becomes ready, and when its 
thread executes, it forwards the result value to another 
closure in another subcomputation as follows. When a 
worker executing a subcomputation executes a result clo- 
sure’s thread, it sends a message to the subcomputation’s 
victim worker. This message contains the subcomputa- 
tion’s name as well as the result value that is the thread’s 
argument. When the victim worker receives this mes- 
sage, it uses the subcomputation name to find the victim 
closure, and then it uses the continuation in the victim 
closure to send the result value to the target. 


To summarize, each subcomputation contains a collec- 
tion of closures and every continuation in a closure refers 
to another closure in the same subcomputation. To send 
a value from a closure in one subcomputation to a clo- 
sure in another, the value must be forwarded through an 
intermediate result closure. 


With this structure, migrating a subcomputation from 
one worker to another is fairly straightforward. At the 
source worker, the entire subcomputation is “pickled” by 
giving each of the subcomputation’s closures a number 
and replacing each continuation’s pointer with the corre- 
sponding number. Then, after sending the closures to the 
destination worker, the destination worker reconstructs 
the subcomputation by reversing the pickling operation. 
The subcomputation keeps its name, so after a migration, 
the first component of the subcomputation name will be 
different than the name of the worker. When the sub- 
computation and all of its closures have been migrated to 
their destination worker, this worker sends a message to 
the subcomputation’s victim worker to inform the victim 
closure of its thief subcomputation’s new thief worker. 
Additionally, for each of the subcomputation’s assigned 
closures, it sends a message to the thief worker to inform 
the thief subcomputation of its victim closure’s new vic- 


tim worker. Thus, all of the links between victim closures 
and thief subcomputations are restored. 


5 Fault tolerance 


With transparent fault tolerance built into the Cilk-NOW 
runtime system, Cilk jobs may survive machine crashes 
or network outages despite the fact that Cilk programs 
are fault oblivious, having been coded with no special 
provision for handling machine or network failures. If 
a worker crashes, then other workers automatically redo 
any work that was lost in the crash. In the case of a more 
catastrophic failure, such as a power outage, a total net- 
work failure, or a crash of the file server, then all work- 
ers may crash. For this case, Cilk-NOW provides au- 
tomatic checkpointing, so when service is restored, the 
Cilk job may be restarted with minimal lost work. Recall 
that Cilk-NOW does not provide fault tolerance for I/O. 

In this section, we show how the structure used to sup- 
port adaptive parallelism—which leverages Cilk’s tree 
structure and the work-stealing scheduler—may be fur- 
ther leveraged to build these fault tolerant capabilities 
in Cilk-NOW. As with adaptive parallelism, all of the 
overhead associated with fault tolerance (other than the 
cost of periodic checkpoints) can be amortized against 
the number of steals which grows at most linearly with 
the critical path and is not a function of the work. 

Given adaptive parallelism, fault tolerance is only a 
shortstep away. With adaptive parallelism, a worker may 
leave a Cilk job, but before doing so, it first migrates 
all of its subcomputations to other workers. In contrast, 
when a worker crashes, all of its subcomputations are 
lost. To support fault tolerance, we add a mechanism 
that allows surviving workers to redo any work that was 
done by the lost subcomputations. Such a mechanism 
must address two fundamental issues. First, not all work 


1997 Annual Technical Conference 


141 


142 


is necessarily idempotent, so redoing work may present 
problems. We address this issue with a technique that we 
call a return transaction. Specifically, we ensure that the 
work done by any given subcomputation does not affect 
the state of any other subcomputations until the given 
subcomputation finishes. Thus, from the point-of-view 
of any other subcomputation, the work of a subcompu- 
tation appears as a transaction: either the subcomputa- 
tion finishes and commits its work by making it visible 
to other subcomputations, or the subcomputation never 
happened. Second, the lost subcomputations may have 
done a large amount of work, and we would like to mini- 
mize the amount of work that needs to be redone. We ad- 
dress this issue by incorporating a transparent and fully 
distributed checkpointing facility. This checkpointing fa- 
cility also allows a Cilk job to be restarted in the case of 
a total system failure in which every worker crashes. 


To turn the work of a subcomputation into a return 
transaction, we modify the behavior of the subcompu- 
tation’s result closure. In Cilk, returning a value is al- 
ways the last operation performed by a Cilk procedure, 
so the result closure cannot be ready until the subcompu- 
tation is finished. In addition, recall that the execution of 
the result closure and the finishing of the subcomputation 
both warrant a message to the victim worker. Thus, we 
bundle these two messages into a single larger message 
sent to the victim worker. When the victim worker re- 
ceives this message, it commits all of the thief subcom- 
putation’s work by sending the appropriate result value 
from the victim closure, freeing the victim closure (and 
its assignment information), and sending an acknowledg- 
ment back to the thief worker. 


With subcomputations having this transactional na- 
ture, a Cilk job can tolerate individual worker crashes 
as follows. Suppose a worker crashes. Eventually, the 
clearinghouse will detect the crash, and the other living 
workers will learn of the crash at the next update from 
the clearinghouse. When a worker learns of a crash, it 
goes through all of its subcomputations, checking each 
assigned closure to see if it is assigned to the crashed 
worker. Each such closure is moved from the assigned 
pool back to the ready pool (and its assignment informa- 
tion is freed). Thus, all of the work done by the closure’s 
thief subcomputation which has been lost in the crash 
will eventually be redone. Additionally, when a worker 
learns of a crash, it goes through all of its subcomputa- 
tions to see if it has any that record the crashed worker as 
the subcomputation’s victim. For each such subcompu- 
tation, the worker aborts it as follows. The worker goes 
through all of the subcomputation’s assigned closures 
sending to each thief worker an abort message speci- 
fying the name of the thief subcomputation. Then the 
worker frees the subcomputation and all of its closures. 
When a worker receives an abort message, it finds the 
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thief subcomputation named in the message and recur- 
sively aborts it. All of the work done by these aborted 
subcomputations must eventually be redone. In order to 
avoid aborting all of these subcomputations (which may 
comprise the entire job in the case when the root subcom- 
putation is lost) and redoing potentially vast amounts of 
work, and in order to allow restarting when the entire job 
is lost, we need checkpointing. 

Cilk-NOW performs automatic checkpointing without 
any synchronization among different workers and with- 
out any notion of global state. Specifically, each sub- 
computation is periodically checkpointed to a file named 
with the subcomputation’s name. For example, a sub- 
computation named r : i would be checkpointed to a file 
named scomp-r.i. We assume that all workers in the 
job haveaccess to acommon file system (through NFS or 
AFS, forexample), and all checkpoint files are written to 
acommon checkpoint directory.? To write a checkpoint 
file for a subcomputation r: i, the worker first opens a 
file named scomp.r_i.temp. Then, it writes the sub- 
computation record and all of the closures—including 
the assignment information for the assigned closures— 
into the file. Finally, it atomically renames the file 
scomp-r.i.temptoscomp-_r.i, overwriting any pre- 
vious checkpoint file. A checkpoint file can be read 
to recover the subcomputation. On writing a check- 
point file, the worker additionally prunes any no-longer- 
needed checkpoint files. 

If workers crash, the lost subcomputations can be re- 
covered from checkpoint files. In the case of a single 
worker crash, the lost subcomputations can be recov- 
ered automatically. When a surviving worker finds that 
it has a subcomputation with a closure assigned to the 
crashed worker, then it can recover the thief subcom- 
putation by reading the checkpoint file. In the case of 
a large-scale failure in which every worker crashes, the 
Cilk job can be restarted from checkpoint files by setting 
the -Recover flag on the command line. Recovery be- 
gins with the root subcomputation whose checkpoint file 
is scomp_0_1. After recovering the root subcomputa- 
tion, then every other subcomputation can be recovered 
by recursively recovering the thief subcomputation for 
each of the root subcomputation’s assigned closures. 


6 Cilk-NOW macroscheduling 


The Cilk-NOW runtime system contains components 
that perform macroscheduling [30]. The macroscheduler 
identifies idle machines and determines which machines 
work on which jobs. In this section, we discuss each 
component of the macroscheduler, and we show how 


?We have not yet implemented any sort of distributed file system. In 
the current implementation. workers implicitly synchronize when they 
write checkpoint files. since they all access a common file system. 
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they work together. 

Like the workers and the clearinghouse, which to- 
gether comprise a single parallel Cilk job. the com- 
ponents of the macroscheduler are distributed across 
the network. As alrcady mentioned, each machine in 
the network runs a node manager, an instance of the 
CilkNodeManager program. that monitors the ma- 
chine’s idleness and the status of a worker if one is 
present. In addition to the clearinghouse, each Cilk 
job executes a single job manager, an instance of the 
CilkJobManager program, that services requests for 
the job description. Each of these components reg- 
isters with a central job broker, an instance of the 
CilkJobBroker program. The job broker keeps track 
of the set of node managers and job managers running 
in the network. The Cilk-NOW runtime system can con- 
tinue operation even if some of these components, in- 
cluding the job broker, fail. 


Each machine in the network runs a node manager 
that is responsible for determining when the machine is 
idle. When the machine is being used, the node manager 
wakes up every 5 seconds to determine if the machine has 
gone idle. It looks at how much time has elapsed since 
the keyboard and mouse have been touched, the number 
of users logged in. and the processor load averages. The 
node manager then passes these values through a predi- 
cate to decide if the machine is idle. A typical predicate 
might require that the keyboard and mouse have not been 
touched for at least 5 minutes and the 1-minute processor 
load average is below 0.35. The predicate can be cus- 
tomized for each machine. We believe that maintaining 
the owner’s sovereignty is essential if we want owners to 
allow their machines to be used for parallel] computation. 
A user can change a predicate with a simple command- 
line utility called CilkPred. For example, issuing the 
command 


CilkPred user=lisiecki global add 
idletime=900 


causes any workstation on the network to require that the 
user “‘lisiecki” be idle for at least 900 seconds. Alterna- 
tively, a user might issue the command 


CilkPred always node=vulture add 
LOad=s.2' 825.22 


which applies only to the workstation Vulture and re- 
quires it to have a load average of 0.2 or less for all of 
the 1, 5.and 15 minute load averages. 

When all applicable conditions of the predicate are 
satisfied, the machine is idle, and the node manager ob- 
tains a job description from a job manager (using an ad- 
dress given by the job broker or another node manager) 
and forks a worker. The node manager then monitors 
the worker and continues to monitor the machine’s idle- 
ness. With a worker running, the node manager wakes 


up once every second to determine if the machine is still 
idle (adding an estimate of the running job’s processor 
usage to any processor load-average threshold). If the 
machine is no longer idle, then the node manager sends 
akill signal to the worker as previously described. When 
the worker process dies for any reason, the node manager 
takes one of two possible actions. If the machine is still 
idle, then it obtains a new job description and forks a new 
worker. If the machine is no longer idle, then it returns 
to monitoring the machine once every 5 seconds. 

When a Cilk job begins execution, a job manager is 
started automatically by the clearinghouse. The clear- 
inghouse submits the job description to the job manager 
as alluded to in Section 3, and then the job manager reg- 
isters itself with the job broker. The job manager then 
goes to sleep, and it periodically wakes up to reregister 
with the job broker in case the job broker has crashed 
and restarted. When the job terminates, the job manager 
unregisters with the job broker. The job manager is the 
central authorizing agent for the job. Any time a node 
manager forks a worker, it receives a copy of the job de- 
scription directly from the job’s job manager. 


When a node manager forks a new worker, it must 
take special precautions that the user specified in the job 
description actually authorized the job to be run. Fail- 
ure to do so would allow an outsider to gain unautho- 
rized access to a user’s account. Furthermore, it is de- 
sirable for the macroscheduler’s protocols to be secure 
against unauthorized messages. For these reasons, all of 
the macroscheduler’s protocols are secured with an ab- 
straction on top of UDP called secure active messages 
(30). This abstraction maintains all of the semantics of 
the split-phase protocols mentioned earlier but adds a 
guarantee of the authenticity of messages to the receiver. 
Unlike a normal UDP message which is sent from one 
network address to another, a secure active message is 
sent between “principals.” A principal is a pair consist- 
ing of a network address and a claim as to the identity 
of the sender. Each secure active message contains user 
data, the sending principal, and whatever additional data 
might be required by the underlying authentication pro- 
tocol, whether that be the standard UNIX rsh protocol 
or a protocol like Kerberos [34]. The security layer is 
very simplistic, providing only enough functionality to 
allow the protocols to be secured in a manner indepen- 
dent of the authentication protocol. 

The decision to receive the job description directly 
from the job manager stems from security considerations 
with protocols like Kerberos, where the job broker and 
node managers are not trusted with the user’s credentials. 
Only the user in the possession of tickets can be trusted 
to start a remote process. Since the job manager runs 
as the desired user, retrieving the job description directly 
and securely from the job manager assures that the user 
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has actually authorized running the job. 

The task of scheduling jobs on workstations is shared 
between the job broker and the node managers. The job 
broker is responsible for ensuring that each job is running 
on at least one workstation. The node managers then 
use a distributed, randomized algorithm to divide the 
workstations evenly among the jobs. Because the node 
managers are capable of performing scheduling, tempo- 
rary outages of the job broker do not impede progress 
in scheduling jobs on the network of workstations. We 
are currently experimenting with a distributed, random- 
ized macroscheduling algorithm that uses steal rates to 
estimate worker utilization. Each job should get its fair 
share of the idle machines, but no job should get more 
machines than it can efficiently utilize. 


7 Related work 


Cilk-NOW is unique in delivering adaptive and reliable 
execution for parallel programs on networks of worksta- 
tions. Traditionally, systems such as PVM [41], Tread- 
Marks [2], and others [11, 16, 23, 29] that are designed 
to support parallel programs on networks of workstations 
have not provided adaptive parallelism or fault tolerance. 
On the other hand, most systems that do provide support 
for adaptive execution or fault tolerance take a “process- 
centric” approach. That is, they provide an abstraction 
of mobile processes and/or an abstraction of reliable pro- 
cesses. As such these systems are very general in their 
potential application, but they do not provide much sup- 
port for parallel programs. In contrast, Cilk-NOW does 
provide support for parallel programs and it does pro- 
vide adaptive parallelism and fault tolerance, but it does 
so only for the Cilk parallel programming model. Such 
specificity allows the Cilk-NOW design to take an end- 
to-end approach [38] that leverages properties of the Cilk 
programming model in order to implement adaptive par- 
allelism and fault tolerance simply and efficiently. 

Distributed operating systems [17, 36, 43, 46] and re- 
mote execution facilities [18, 19, 31, 35, 47] provide 
services such as remote process execution and, in some 
cases, process migration. These systems are not intended 
to be parallel programming environments, though pre- 
sumably a parallel programming environment could by 
built atop one of these systems. In fact, Orca [42], which 
has been built on top of Amoeba, is such a system. These 
systems are process-centric in that they adapt only by re- 
motely executing and/or migrating processes. 

A small number of parallel programming and runtime 
systems have been built that are adaptively parallel, but 
unlike Cilk-NOW, none are fault tolerant. Possibly the 
first adaptively parallel system is the Benevolent Bandit 
Laboratory (BBL) [22], and Cilk-NOW borrows some 
of its overall system architecture from BBL. The Pi- 
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ranha system (24, 27], which is based on the Linda pro- 
gramming model [12], is also adaptively parallel. (In 
fact, the authors of Piranha appear to have coined the 
term “adaptively parallel.”) These systems support pro- 
gramming models that are quite different from Cilk’s, 
but as with the Cilk-NOW design, both leverage prop- 
erties of their programming model in order to imple- 
ment adaptive parallelism. A runtime system for the 
programming language COOL [14] running on symmet- 
ric multiprocessors [44] and cache-coherent, distributed, 
shared-memory machines uses process control to sup- 
port adaptive parallelism. This system relies on special- 
purpose operating system and hardware support. In con- 
trast, Cilk-NOW supports adaptive parallelism entirely 
in user-level software on top of commercial hardware 
and operating systems. The Spawn system [45] sup- 
ports concurrent applications with dynamic and adaptive 
resource management policies based on microeconomic 
principles. Unlike Cilk-NOW, none of these systems are 
fault tolerant. 


A growing number of systems do provide fault toler- 
ance, but unlike Cilk-NOW, none provide “application” 
fault tolerance in a high-level parallel programming en- 
vironment. The Hive [15] distributed operating system 
provides “system” fault tolerance, meaning that a fault 
in one component does not bring down the entire system. 
Hive does not, however, provide “application” fault tol- 
erance, meaning that with Hive, if an application is using 
a failed component, then the entire application crashes 
(unless the application itself has taken care to be fault 
tolerant). Application fault tolerance is provided by the 
Manetho system [21] via the technique of message log- 
ging. The Sam system [39] uses message logging to im- 
plement a fault tolerant distributed shared memory. The 
Sam implementation leverages properties of its shared- 
memory consistency model in order to avoid logging cer- 
tain messages. Unlike Cilk-NOW, both Manetho and 
Sam are process-centric, as they both provide the abstrac- 
tion of reliable processes, and neither is really a high- 
level parallel programming environment. 


In comparing Cilk-NOW with these other process- 
centric systems, an interesting question to ask is, why 
not build the Cilk-NOW runtime system on top of one 
these other systems? After all, these systems already 
implement adaptive and/or fault-tolerant execution. The 
answer is performance. As we have seen, the overhead 
of adaptive parallelism and fault tolerance in Cilk-NOW 
is amortized against the overhead in Cilk’s provably ef- 
ficient scheduling algorithm. This amortization is only 
possible because all facets of the design are specialized 
with high-level knowledge of algorithmic structure in the 
Cilk programming model. 
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8 Conclusion 


The widespread use of NOWs for parallel computation 
requires a software infrastructure that allows program- 
mers to code in a high-level language that abstracts 
away the complexity of protocols, scheduling, and re- 
source management. Cilk and Cilk-NOW are part of 
this developing software infrastructure. In this paper, we 
have shown how Cilk-NOW’s end-to-end design lever- 
ages structure in the Cilk programming model to imple- 
ment adaptive parallelism and fault tolerance simply and 
efficiently. All overheads are amortized against work- 
stealing operations, and the number of steals grows with 
the critical path and not with the work. This result is only 
possible because Cilk-NOW incorporates Cilk-specific 
policies at all levels of its design. 

The Cilk-NOW runtime system, as described in this 
paper and as currently implemented, supports the Cilk-2 
language which is essentially functional in that it does 
not have support for a global address space or paral- 
lel YO. More recent incarnations of Cilk for MPPs 
and SMPs have support for a global address space us- 
ing “dag-consistent” distributed shared memory [7], and 
we are currently working on extensions for parallel I/O. 
With these additions to Cilk, preserving Cilk-NOW’s 
adaptive and fault tolerant execution model remains a 
challenging open problem. We are currently working on 
this problem. The dag-consistency model was conceived 
with adaptive parallelism and fault tolerance in mind, 
and we are investigating the idea of coupling our current 
return transactions mechanism with a causal message- 
logging mechanism [1]. 

In other current research, we are investigating dis- 
tributed macroscheduling algorithms. The goal of such 
an algorithm is to assign idle workstations to Cilk jobs 
so that each job gets a “fair” share and without requir- 
ing that users explicitly state their application’s resource 
needs. It turns out that the parallelism of a Cilk job can be 
determined continuously and automatically by monitor- 
ing the steal rate. We are examining a macroscheduling 
algorithm in which this information is used in random- 
ized pairwise interactions among processors. The idea 
is that periodically (and asynchronously) each processor 
picks another processor in the network at random, and 
if the two processors are working on different jobs, then 
one processor may switch to the job of the other. The 
switching decision is randomized and based only on in- 
formation about the parallelism and size of the two jobs 
involved. Early simulation results indicate that such a 
scheme is very effective [30], and we are currently work- 
ing on analysis and implementation. 

More information about Cilk, including papers, docu- 
mentation, and software releases, but not including Cilk- 
NOW software. can be found on the World-Wide Web at 
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http://theory.lces.mit.edu/~cilk. 
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Abstract 


This paper describes the design and implementa- 
tion of a distributed shared memory facility we have 
implemented for the FreeBSD operating system (a 
descendant of 4.4BSD that runs on the PC architec- 
ture). Interesting aspects of the design are: (1) the 
consistency protocol uses unreliable datagram com- 
munication, but is robust with respect to message 
loss, and in the normal case requires only two data- 
grams to handle a read fault; (2) the facility provides 
a simple programming interface that does not require 
any socket or network programming to use; (3) the 
facility extends the FreeBSD VM system in a very 
non-intrusive way. 


1 Introduction 

A distributed shared memory (DSM) facility per- 
mits processes running at separate hosts on a net- 
work to share virtual memory in a transparent fash- 
ion, as if the processes were actually running on a 
single processor {LH89]. This is accomplished with 
the help of the virtual memory (VM) subsystem, 
which identifies page faults on DSM pages and in- 
vokes the DSM subsystem to retrieve data over the 
network and perform necessary synchronization. 

This paper describes the design and implemen- 
tation of a distributed shared memory system we 
have built for the FreeBSD 2.1 operating system, a 
descendant of 4.4BSD that runs on the PC architec- 
ture. The following goals were important in shaping 
the design of our facility: 


e A simple client application interface, which 
would be as close as possible to ordinary mem- 
ory. 


*Product names used in this publication are used for iden- 
tification purposes only and may be trademarks of their re- 
spective companies or organizations. 
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e To make the basic DSM read page/write page 
operations as efficient as possible in the normal 
case. 


e A nice fit with the existing FreeBSD VM sys- 
tem, with minimal changes to existing FreeBSD 
kernel code, and providing flexibility for exper- 
imentation with different consistency protocols. 


The program ming model presented to client ap- 
plications centers around the notion of DSM objects, 
which are used in a fashion analogous to the use of 
memory-mapped files. No network or socket pro- 
gram ming is required of an application in order to 
use the DSM facility. At the kernel level, the no- 
tion of DSM object fits together nicely with the 
VM objects that already exist in the Mach-derived 
FreeBSD VM subsystem, enabling us to add the 
DSM facility to FreeBSD in a very non-intrusive 
fashion. 

To achieve efficiency and low communication 
complexity, we adopted as a basic design decision 
that unreliable datagram communication (our im- 
plementation uses UDP) should be used whenever 
possible. The protocol we designed, which is a write- 
invalidate protocol that ensures sequential consis- 
tency [Lam79], requires in the normal case only two 
datagrams (request and reply) to retrieve a copy of 
a page from a remote host, and a total of n+c+1 
datagrams (or an n-way multicast plus c + 1 indi- 
vidual datagrams) to obtain write permission on a 
copy of a page, where n is the number of hosts inter- 
ested in the object and c is the number of hosts that 
actually hold copies of the page. The protocol is ro- 
bust in the face of loss or reordering of datagrams, 
though in this case or in the case of contention for 
pages, additional messages may be required. Mea- 
surements of basic latencies show that our read page 
fault are less than 3 ms, which is within 1.5 ms of 
the best published results [BB93] we know of. 

The implementation of the DSM facility required 
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only minimal changes to the existing FreeBSD ker- 
nel code: only about 100 lines of additions or mod- 
ifications were made to previously existing kernel 
files. The rest of the system is split between about 
3000 lines of new kernel code and about 5000 lines 
for a user-level DSM server program. The user- 
mode server implements essentially all aspects of the 
consistency protocol, and interacts with the kernel 
through a narrow interface, thus allowing easy ex- 
perimentation with various consistency protocols. 

The remainder of the paper describes in more de- 
tail some of the more interesting aspects of our sys- 
tem. 


2 Architectural Overview 
2.1 Programming Model 


Our DSM facility centers around the concept of 
a DSM object, which is a virtual address space that 
consists of a sequence of shared pages. A process 
wishing to access a DSM object must first obtain 
for that object: (1) a UID, which uniquely iden- 
tifies that object among all other DSM objects in 
the world, and (2) the network address of a DSM 
server that knows about that object. A UID is 
obtained either by requesting the creation of a new 
DSM object, or else by receiving the UID of an exist- 
ing DSM object through some communication chan- 
nel outside the DSM facility. Network addresses are 
obtained by similar means. Once a process has the 
UID of a DSM object, and a corresponding server 
address it requests to attach to the object, using a 
system call provided for this purpose. After attach- 
ing to the object, the process uses another system 
call to map pages from the DSM object into its own 
virtual address space. When a process has finished 
accessing a DSM object, it asks to detach from the 
object; in response to this request the DSM facility 
deletes any existing mappings of that object. The at- 
tach/map/detach paradigm for DSM objects is anal- 
ogous to the open/map/close paradigm for memory- 
mapped files. It is also quite similar to what is pro- 
vided by the System V shm [ATT90] shared memory 
facility for interprocess communication. 

Once a process has mapped DSM pages into its 
virtual address space, normal memory references to 
the mapped virtual addresses are used to access data 
in the DSM object. As usual, such memory ref- 
erences will cause a page fault if either the corre- 
sponding page is not resident in physical memory, 
or else the page does not have the appropriate ac- 
cess permissions set. When a page fault occurs for 
a virtual address that has been mapped to a DSM 
object, the kernel page fault handler dispatches a re- 
quest to the DSM subsystem. The DSM subsystem 
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handles this request, communicating, if necessary, 
with DSM servers elsewhere in the network either 
to obtain a copy of the page to be read, or else to 
synchronize with the other servers to ensure that a 
write operation can be performed on a page without 
violating data consistency guarantees. Once the re- 
quired communication and synchronization has been 
performed, the DSM subsystem responds to the page 
fault handler, and the faulting process is allowed to 
continue. 

An important feature of the above programming 
model is that processes using the DSM feature do 
not have to contain any code for communication over 
the network. The only aspect of network program- 
ming that shows through the interface is the net- 
work address required initially to attach to a DSM 
object, however, this network address can be treated 
opaquely, as simply a string of bits that is passed to 
the kernel as an argument to the attach request. 


2.2 System Structure and Kernel) Inter- 
faces 

The DSM subsystem has a client/server struc- 
ture, and contains both kernel and user-level com- 
ponents. The overall organization is depicted in Fig- 
ure 1. Clients are the user-level application processes 
that make use of the DSM facility. Arbitrarily many 
clients can run on a single host computer. To sup- 
port the DSM operations of the clients, asingle DSM 
server process runs on each host computer provid- 
ing DSM service. The DSM server is also a user- 
level process, though it is a privileged process that 
makes use of special DSM system calls provided by 
the kernel. The kernel portion of the DSM subsys- 
tem consists of (1) DSM pager code, which runs on 
behalf of a client process as aresult of a page fault, 
(2) client system calls, which allow clients to attach, 
map, and detach DSM objects as described above, 
and (3) server system calls, which provide the DSM 
server process with the access to the VM system it 
needs to carry out its function. 

As described above, the kemel page fault han- 
dler invokes the DSM pager code in response to a 
page fault by a client process involving a virtual ad- 
dress that has been mapped to a page in a DSM 
object. The DSM pager does not itself perform 
any communication or synchronization with remote 
DSM servers. Instead, it sends a request datagram 
to the local DSM server indicating the type of ser- 
vice that is required, and then sleeps awaiting a re- 
sponse. The local DSM server receives and handles 
this request datagram, possibly communicating with 
DSM servers elsewhere in the network as a result. 
When the required communication and synchroniza- 
tion has been performed, and any requested DSM 
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Figure 1: General architecture of the DSM facility. 


data is available in local physical memory, the DSM 
server uses a special system call to awaken the client 
process sleeping in the DSM pager. The special sys- 
tem call is used so that the DSM server can wake 
up clients directly in the kernel, instead of requiring 
every client application to contain code for receiving 
and interpreting reply datagrams from the server. 


To simplify the structure of the DSM server pro- 
gram, no attempt is made by the server to keep track 
of the status of client operations in progress and to 
ensure they succeed. Thus, it is possible that a re- 
quest sent by the local DSM server to a DSM server 
on a remote host might fail to elicit a response from 
the remote host; this failure in turn would mean that 
the local server might never respond to the client 
that issued the request. In such a situation, the onus 
is on the client to get things moving again: if the lo- 
cal DSM server fails to awaken the client process 
after a suitable interval, the client times out from 
the kernel sleep routine and resubmits the request 
to the server. 


The client system call interface consists of the 
following system calls (Figure 2a): dsmcreate, 
dsmattach, dsmdetach, dsmmap and dsmwait. 


The dsmcreate call takes as an argument a size 
in bytes, and causes a new DSM object of that size 
to be created. The UID of the newly created object 
is returned. The dsmattach call takes as arguments 
the UID of a DSM object and the network address 
at which a server who knows about that object can 
be contacted, and it arranges for the calling process 
to become attached to the specified DSM object. 
The return value indicates success or failure. The 
dsmdetach call takes the UID of a DSM object as 
its single argument, and it causes the calling process 
to become detached from the specified object. The 


dsmcreate, dsmattach, and dsmdetach system calls 
are not actually executed in the context of the client 
process. Rather, they cause a request datagram to 
be sent to the local DSM server, who performs the 
requested service and returns a response to the client 
waiting in the kernel. 


The client dsmmap call is identical to that of the 
previously existing mmap call, used to memory map 
files, except that dsmmap requires the UID of a DSM 
object to be supplied instead of a file descriptor. In 
spite of the overlap between dsmmap and mmap, we 
chose to keep them separate in the current imple- 
mentation to avoid modifications to existing code. 
The dsmwait call is used by a client process to avoid 
expensive busy waiting on DSM data. It takes as 
arguments the UID of a DSM object and the offset 
of a particular byte in that object, and it causes the 
caller to sleep as long as it can be guaranteed that 
the data byte at that offset has not been changed. As 
soon as this guarantee can no longer be made (for ex- 
ample, if a remote host obtains write permission on 
the page containing the particular byte), the server 
awakes the client, which returns to user mode. 


The server system call interface consists of the 
following system calls (see Figure 2b): dsmservice, 
dsmcreate, dsmdelete, dsmrespond, dsminvalid, 
dsmwritepage, dsmsendpage, and dsmrecvpage. 
The dsmservice call is used by the DSM server pro- 
cess on startup to identify itself to the kernel, to pre- 
vent any other DSM server processes from starting, 
and to enable access to the remaining server calls. 
The dsmcreate and dsmdelete calls are used to in- 
formthe kernel of the creation and deletion of a DSM 
object. The dsmrespond call is used by the server 
to wake up a client process sleeping in the kernel 
while awaiting DSM service. The DSM server uses 
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int dsmcreate (int size); 
int dsmattach(int objid, 
int dsmdetach(int objid); 
caddr_t dsmmap(caddr_t addr, int len, 
int dsmwait(int objid, off_t offset); 


struct sockaddr *addr, 


int prot, 


int len); 


int flags, int objid, off_t offset); 


a) Client system call interface. 


int dsmservice (int socket) ; 


int dsmcreate(int size) j; 

int dsmdelete(int objid) ; 

int dsmrespond(int responseid, int result); 

int dsmwritepage (int objid, off_t offset); 

int dsminvalid(int objid, off_t offset); 

int dsmsendpage (int objid, off_t offset, dsm_data_packet_t dpkt, 
struct sockaddr *addr, int len); 

int dsmrecvpage(int objid, off_t offset); 


b) Server system call interface. 


Figure 2: System call interface. 


the dsmwritepage call to tell the kernel that it is 
safe to write enable a particular page of DSM data. 
The dsminvalid call causes the kernel to invalidate 
any copies it may have of a particular page of DSM 
data, so that subsequent attempts by clients to ac- 
cess these pages will fault. Finally, the dsmsendpage 
and dsmrecvpage are called by the server to send 
and receive a page of DSM data over the network. 
System calls are provided for this in order to enable 
the transmission and reception of DSM data directly 
from or to the appropriate page of physical memory, 
so that the user-level server process never touches 
the actual DSM data. Without these calls, send- 
ing a page of DSM data to a remote host would be a 
much more costly operation involving the copying of 
data from the kernel to the server process’ address 
space, then copying the data back into the kernel 
for transmission over the network, followed by the 
reverse sequence at the destination host. 

Note that although the system call interface pro- 
vides several logically separate system calls, in fact 
the implementation uses only one actual system en- 
try, dsmsys(), which dispatches on its first argument 
to invoke the appropriate function. This scheme 
saves system call numbers and is similar to the sys- 
tem call interface of System V shared memory. 

In the first version of our system, the DSM server 
executed completely in the kernel, similar to what 
occurs in the NFS network file system. This was 
done for efficiency reasons, and because at the out- 
set we did not have a clear picture of what sort 
of kernel interface would be required for a user- 
level server. Unfortunately, the complexity of the 


1997 Annual Technical Conference 


server data structures and storage management is- 
sues were such that it became too difficult to com- 
pletely debug a kernel-mode server. In fact, one of 
the most difficult aspects of the server implementa- 
tion was implementing a suitable reference counting 
scheme for DSM data structures, so that DSM re- 
sources would be reclaimed automatically when no 
processes were using them any more. To aid in de- 
bugging, we decided to reimplement the server as 
a user-level process, which communicates with the 
kernel through the narrow, system call interface just 
described. This interface is largely independent of 
the details of the consistency protocol, a feature we 
have found very useful while refining and debugging 
our particular protocol. 


3 Details of the Kernel DSM Subsys- 
tem 


An important design goal for our DSM facility 
was to have it mesh nicely with the structure of the 
FreeBSD VM system, and to require minimal modifi- 
cations to existing code. We feel we were reasonably 
successful at meeting this goal; in the rest of this 
section we describe in more detail some of the more 
interesting aspects of the design. 

The FreeBSD virtual memory system is based on 
that of 4.4BSD [MBKQ96], which in turn is derived 
from that of Mach (Figure 3). A fundamental con- 
cept in the Mach VM system [Tev87] is the concept 
of a VM object, which consists essentially of a place- 
holder for a sequence of physical pages, together with 
an associated pager, which is a set of functions that 
can be invoked to retrieve data from a backing store, 
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Figure 3: Simplified view of the Mach VM architec- 
ture. 


such as a swap area, a file, or a hardware device. A 
process obtains access to the data in a VM object 
by mapping some or al] of its pages into its address 
space. Typically a process has only a few mappings 
of VM objects at a time, but actually there is no 
limit on the number of mappings or the number of 
VM objects whose pages are mapped. 


In the FreeBSD VM system, the allocation of 
physical memory pages for a VM object is decou- 
pled from the mapping of the object into a process’ 
address space. Physical pages only need to be allo- 
cated in a VM object when an attempt is actually 
made by a process to access data at an offset in a 
VM object for which no page has yet been allocated. 
Such an attempt produces a page fault, and the page 
fault handler not only allocates a physical page for 
that data, but also invokes the pager to retrieve the 
data from backing store. The stock FreeBSD VM 
system supports three different types of pagers: a 
swap pager, which manages the swap area, a unode 
pager, which is used to page data to and from disk 
files, and a device pager, which is used to page data 
directly to a hardware device. Additional types of 
pagers are easily defined. 

Our DSM facility was designed to take advantage 
of the existing FreeBSD VM subsystem. Each DSM 
object contains an underlying VM object. Mapping 
a DSM object into a process’ address space amounts 
to simply mapping the underlying VM object. To 
support the fetching of data over the network, we 
introduced a new type of pager, called a DSM pager. 
When a fault occurs on a memory address that is 
mapped to a page in a DSM object, the page fault 
handler invokes the DSM pager. The DSM pager 
determines the status of that page by checking data 


structures maintained by the DSM subsystem, and 
then requests the local DSM server to perform any 
communication or synchronization required to bring 
a copy of the data into loca] physical memory, and 
to write enable it, if necessary. 

At the rather coarse level of detail of the descrip- 
tion so far, the interaction of the DSM facility with 
the page fault handler seems quite simple. However 
there are some technical issues that make things a 
bit more complex than they seem at first. First of 
all, the stock FreeBSD page fault handler is an ex- 
tremely complex routine involving many subtle syn- 
chronization issues. In order to avoid a difficult de- 
bugging task, and to make it easier to track future 
releases of FreeBSD, we wanted to modify as little 
of the page fault handler as possible. Some modi- 
fication to the page fault handler was necessary, be- 
cause whereas the DSM pager needed to be informed 
as to whether a read fault or write fault was being 
handled, the pager interface in FreeBSD did not con- 
tain any provision for passing this information to the 
pager from the page fault handler. 

A second technical issue was that the DSM sub- 
system had to be responsive to requests from the 
pageout daemon to clean pages of physical memory. 
The obvious thing for the DSM pager to do when 
asked to clean a page would be to write it to the 
swap area. However, I/O to the swap area has to 
be asynchronous to avoid blocking the pageout dae- 
mon, and since there was already an existing swap 
pager that contained the complicated code neces- 
sary to perform this asynchronous I/O, we wanted 
to make use of it if possible. 

A third issue was how the user-level DSM server 
could arrange for DSM data stored at the local host 
to be transmitted over the network, even if this data 
happens to currently reside in the swap area. 

To understand how we dealt with the above tech- 
nical issues, it is necessary for us to describe one 
more feature of the FreeBSD /Mach VM system. In 
the FreeBSD VM system, VM objects can be linked 
together in so-called “shadow chains,” which have a 
special significance to the page fault handler. When 
a page fault occurs for a page mapped to the first 
object in a chain, the page fault handler first con- 
sults the associated pager (if any) for that object, 
to try to handle the fault. If the pager for the first 
object fails to handle the fault, then the page fault 
handler tries the second object in the chain, and 
so on. Thus, each object in such a chain serves as 
a “backing object” for the preceding object, in the 
sense that if a page is not found in the preceding 
object, an attempt is made to obtain the page from 
the next object. In the original literature [Tev87] 
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describing this scheme, each object in a chain is said 
to be a “shadow” of the next object. However, this 
terminology has turned out to be very confusing, so 
we prefer to use the “backing object” terminology 
instead. 

A major purpose of the chains of VM objects 
in FreeBSD is to support “copy-on-write” for effi- 
cient forking. However, for the DSM facility we use 
these chains in a different way (Figure 4), which we 
now describe. We have already mentioned that each 
DSM object has an underlying VM object, which is 
the actual target of mapping operations by a pro- 
cess. This underlying object has an associated DSM 
pager, which encapsulates knowledge of how to ob- 
tain pages over the network and how to synchronize 
with the DSM subsystem at remote hosts. In addi- 
tion, when a DSM object is created, we also create a 
second VM object that serves as a backing object for 
the first, so that underlying a DSM object is always 
a chain of two VM objects. The pager associated 
with the backing object is not a DSM pager, but 
rather a swap pager. 

When a page fault occurs for an address mapped 
into a DSM object, the normal operation of the page 
fault handler is to check the first of the two under- 
lying VM objects to try to obtain the page. When 
the DSM pager associated with the first object is in- 
voked, it determines: (1) that no copy of the page is 
available at the local host, or (2) that a copy of the 
page is locally available, but it might be paged out 
locally to the swap area. In case (1), the DSM pager 
sends a datagram to the local DSM server request- 
ing the transfer of a copy of the page to the local 
host, and it sleeps awaiting the arrival of the page. 
In case (2), the DSM pager returns a failure indica- 
tion to the page fault handler, which then moves to 
the backing object. The page fault handler either 
finds the requested data already in physical mem- 
ory mapped from the backing object, or else invokes 
the swap pager associated with the backing object 
to bring the data in from the swap area. 

The utility of the two-element chain underlying 
a DSM object becomes evident when one considers 
how to implement the “pageout” operation of the 
DSM pager. Rather than having the DSM pager per- 
form a complicated algorithm for asynchronous I/O 
to the swap area, in response to a “clean” request 
from the pageout daemon, the DSM pager simply 
copies the data from the first object in the chain 
to the corresponding position in the second object. 
This doesn’t immediately free up physical memory, 
but if there is a high demand for memory, the page- 
out daemon will eventually ask the swap pager asso- 
ciated with the second object in the chain to clean 
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its page, in which case the data will be written to 
the swap area using the standard pageout code. 

Thus, the two-element chain of VM objects un- 
derlying each DSM object permits the DSM system 
easy access to the normal swap area, with hardly 
any additional code required. This organization also 
pays off when the DSM server wishes to transmit to 
a remote host DSM data that has been paged out 
locally. As discussed above, transmission of DSM 
data is accomplished by the special dsmsendpage 
server system call, to minimize the number of copies 
of a page’s data when sending that page to a remote 
server. This system call simply maps the page of 
the first object in the two-object chain into the ker- 
nel address space, and then copies data from that 
page to the network subsystem. If the page hap- 
pens not to be resident in physical memory, a page 
fault will occur, and since the DSM pager does not 
have the page, the page fault handler will follow the 
object chain and find the page in the backing object. 

Similarly, the reception of DSM data makes use 
of the special dsmrecvpage server system call, to 
minimize the number of copies of DSM data during 
reception. This system call temporarily maps the 
physical page, which was allocated when the page 
fault first occurred, into the kernel address space 
and copies the data from the network subsystem to 
that page. 

To support the scheme described above, the ker- 
nel needs to have a certain amount of information 
about the state of the DSM subsystem. In particu- 
lar, the kernel keeps a data structure for each DSM 
object that points to the associated VM objects, and 
keeps track of the location and status (available lo- 
cally /available remotely, resident /nonresident, write 
enabled/write protected) of each page in the object. 
It also keeps a list of pending DSM requests from 
clients, so that the proper client process can be iden- 
tified and awakened when the DSM server responds 
to such a request. The kernel does not need to know 
anything about the particulars of the DSM protocol 
or about remote sites participating in the DSM pro- 
tocol; this is entirely the responsibility of the user- 
level DSM server process. 


4 DSM Protocol 


This section describes the DSM protocol executed 
by our user-level DSM server processes. Essentially, 
there are two related protocols: (1) a membership 
protocol, which keeps track of hosts that are cur- 
rently interested in accessing a DSM object, and (2) 
the consistency protocol, which is executed by hosts 
wishing to read or write pages in DSM objects. The 
consistency protocol is executed frequently: every 
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Figure 4: Use of object shadow chains by the DSM facility. 


time a host requires an up-to-date copy of a DSM 
page or needs to obtain write permission on a page. 
The membership protocol is executed less frequently, 
but it must be reliable in the sense that the correct- 
ness of the consistency protocol depends on the accu- 
racy of the information maintained by the member- 
ship protocol. For efficiency, the frequently executed 
consistency protocol uses unreliable datagrams in all 
situations but one. On the other hand, the less- 
frequently executed membership protocol uses re- 
liable datagrams for all communications, where by 
reliable we mean that a timeout/retransmit scheme 
is used to guarantee delivery, and ordering of data- 
grams between each pair of hosts is maintained using 
a sequence numbering scheme. 

To explain the protocol, we first need to introduce 
some preliminary concepts and terminology. When a 
DSM object is first created, the only host that knows 
about that object is the host at which the object is 
created. This host is called the object manager, and 
it plays a special role in some aspects of the DSM 
protocol. Application processes at other hosts be- 
come informed about the existence of a DSM object 
by receiving its UID via some form of communica- 
tion outside the DSM system. The first time an 
application process at a host tries to attach to the 
DSM object, the DSM server at that host applies to 
the manager of the object for membership in the ob- 
ject; that is, it asks to be added to the list of all hosts 
that are currently interested in that object. When 
the manager grants membership to a new member, 
it informs all previous members about the new mem- 
ber, and it informs the new member of the current 


membership list, so that each member of a DSM ob- 
ject knows at all times who all the other members 
are. Members of an object can resign their member- 
ship at any time; in this case a message is sent to 
the manager, who informs the remaining members 
about the change. 

The consistency protocol belongs to the class of 
protocols that Li [LH89] calls “dynamic distributed 
manager algorithms with page invalidation.” Just as 
each DSM object has a manager, each page within 
a DSM object has an owner. However, unlike the 
object manager, which is fixed at the time the object 
is created and never changes, the owner of a page 
changes during execution. Initially, the manager of 
an object owns all pages in the object. Each time a 
host receives write permission on a p age in an object, 
it becomes the owner of that page. The owner of a 
page has the responsibility of safeguarding the data 
in a page, until it has determined which host will be 
the next owner and has successfully transferred the 
data to that host. 

Each host that is a member of a DSM object 
maintains the following state information for the ob- 
ject as a whole: (1) the number of local clients at- 
tached to this object; (2) the identity of the object 
manager; (3) the object UID; (4) the current mem- 
bership list for the object; (5) the size of the object, 
in bytes and pages. 

In addition, for each page in the object, the fol- 
lowing state is maintained: (1) an “owner hint” in- 
dicating who the current owner of that page might 
be; (2) a “version hint” indicating the current ver- 
sion number of the page; (3) a “copies hint” indi- 
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cating how many copies there might be of the page; 
(4) a fiag indicating whether this host has a copy of 
the page; (5) a fiag indicating whether this host has 
write permission on the page. 

The purpose of the owner hint for a page is to 
try to route requests for a page quickly to the DSM 
server that has the most recent data for that page. 
This hint may become stale, but the protocol guar- 
antees that the host mentioned in the hint is always 
closer to the actual owner than the host holding the 
hint. Furthermore, the owner hint is always accu- 
rate if the host currently has a copy of the page. 
The version hint is used to filter out datagrams re- 
ceived out of order, which, if processed, might lead 
the system to an inconsistent state. The copies hint 
is an estimate of the number of copies of the page 
that exist. This estimate is conservative in the sense 
that the estimate held by the owner of a page is al- 
ways at least as large as the actual number of copies 
in existence. 

There are two basic operations of the DSM proto- 
col: READ (reading a page) and (WRITE) writing 
a page. We now discuss these operations in some 
detail. 


4.1 Reading a Page 


Figures 5a) to 5c) show three interesting cases of a 
READ operation. Figure 5a) shows a simple READ, 
in which the host wishing to obtain a copy of a page 
has an accurate owner hint, and no messages are 
lost. In this case, only two messages are required: 
a READ message from the requesting host to the 
owner, and a DATA reply from the owner. This is 
the situation we expect to occur most frequently in 
actual execution. Note that because all messages 
exchanged in the consistency protocol use unreliable 
datagrams, there are no hidden acknowledgments or 
other messages, and so exactly two messages are re- 
quired to read a page in this situation. 

A slightly more complicated case is when the 
owner hint held by the requesting host is stale. In 
this case, the host that receives the READ message 
uses its own hint to forward the message toward the 
actual owner, as depicted in Figure 5b). The DATA 
reply goes directly to the requesting host, who up- 
dates its owner hint upon receipt. Since owner hints 
are updated whenever a host receives new informa- 
tion about the owner of a page, we expect that 
READ messages will generally be forwarded only a 
few hops. 

Figure 5c) shows a scenario in whicha DATA mes- 
sage is lost. In this case, a timeout at the requesting 
host (by a process sleeping in the DSM pager code) 
triggers the retransmission of the READ message. 
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This mechanism may lead to the reception of mul- 
tiple DATA messages containing the same data. To 
detect this situation, DATA messages include the 
current version number of the page, and a host re- 
ceiving a DATA message discards the message if the 
version is either older than the most recent version 
of which the host is aware, or the same as the version 
of any currently held copy. To provide quick error 
recovery, but to avoid flooding the system with re- 
transmitted READ messages in case of heavy load, 
we use an exponential backoff scheme with an up- 
per bound to increase the timeout value in case 
the READ message has to be retransmitted several 
times. 

4.2 Writing a Page 

We use a write-invalidate strategy to ensure se- 
quential consistency. That is, before the DSM server 
at a host allows a client process to modify a page, 
it invalidates all copies of that page that exist at 
remote hosts. 

In order for a host to initiate a WRITE opera- 
tion, it is first required to have a copy of the cur- 
rent version of the page. If it wishes to perform a 
WRITE, but it does not have a current copy, it first 
executes a READ operation to obtain a copy as de- 
scribed above. There are two reasons for requiring 
a host wishing to write to have a current copy: (1) 
it ensures that the host knows the current owner of 
the page, and (2) it ensures that subsequent message 
loss during the WRITE operation cannot cause the 
loss of the data in the page. 

Our protocol for writing a page has some uncom- 
mon features not usually found in protocols of this 
type. First, the owner of a page keeps track only of 
the number of copies of that page, rather than the 
actual identities of the hosts that have those copies. 
Second, whereas in all the similar protocols that we 
know of the write part of the protocol has two dis- 
tinct operations: the ownership transfer operation 
and the remote copies invalidation operation, in our 
protocol the transfer of ownership and the invalida- 
tion are combined into a single operation. 

Figure 6a) shows what we expect to be the most 
common case of the WRITE operation, in which 
there are only a few copies of the page, the owner’s 
copies hint is accurate, and only one host is at- 
tempting to write the page. In this situation, the 
host wishing to write multicasts a WRITE mes- 
sage to every host in the current membership list 
for the DSM object containing the page to be writ- 
ten. When a member receives the WRITE mes- 
sage, it updates its owner hint to point to the host 
that issued the WRITE message, and then, if and 
only if it has a copy of the page, it invalidates that 


USENIX Association 


USENIX Association 


Owner 


READ 


DATA 


a) b) 





Owner Owner 


READ 


READ(fwd) DATA 


READ 


DATA 


c) 


Figure 5: Read operation of the DSM protocol. 


copy and responds with a WRITEOK message. The 
WRITEOK message sent by the current owner of 
the page implicitly carries with it a transfer of own- 
ership of that page to the requesting host, who, upon 
receipt of such a message, becomes the new owner, 
and issues an acknowledgment to the previous owner 
to release it from any further ownership responsibil- 
ities. The WRITEOK message sent by the previous 
owner also includes a copies hint, which is then used 
by the new owner to determine when all the remote 
copies of the page have been invalidated. Specifi- 
cally, the new owner knows that all copies have been 
invalidated when the number of WRITEOK mes- 
sages it has received is equal to the copies hint it 
received in the WRITEOK message from the previ- 
ous owner. 

As it is absolutely essential that there be no ambi- 
guity about whether ownership has been transferred, 
the previous owner must wait for its WRITEOK 
message to be acknowledged, retransmitting the 
WRITEOK if necessary, before continuing with any 
other activity regarding this page. The WRITEOK 
from the previous owner to the new owner is the 
only reliable datagram used in the consistency pro- 
tocol. However, observe that the acknowledgement 
message used by the reliable datagram service is not 
in the critical path, as the new owner does not have 
to wait for the acknowledgement to reach the previ- 
ous owner. 

Figure 6b) illustrates what happens in the case of 
write contention; that is, when two or more hosts try 
to write the same DSM page at the same time. When 
the owner of the page processes the first WRITE 
message for that page, it invalidates its copy of the 
page and replies with a WRITEOK. If it later sees 
a WRITE message from some other host for the 
same version of the page, it simply discards the later 
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WRITE message. This ensures that only one host 
will become the owner of the next version of the 
page, and consequently at most one host will be 
granted the right to write that page. 

A host trying to write a page also discards any 
WRITE messages it receives. This is necessary, be- 
cause the only alternative would be for the host to 
invalidate its copy of the page, but then the page 
would be lost if the host should happen to be granted 
ownership by the previous owner. Thus, without any 
special provisions, a deadlock could result when two 
hosts try to write the same page and each stead- 
fastly refuses to invalidate and send a WRITEOK 
to the other. To handle this situation, a host starts 
a timer when it first multicasts a WRITE message. 
If, by the time the timer has expired, it has received 
ownership of the page but has not received enough 
WRITEOK replies, it multicasts a PURGE message 
to the membership list. Upon receiving a PURGE 
message, every member is obligated to invalidate any 
copy it holds, to update its owner hint to point to the 
sender, and to reply to the sender with a WRITEOK 
message. If insufficient WRITEOK messages are re- 
ceived after a suitable period, the owner of the page 
multicasts another PURGE message. This scenario 
repeats until the owner is sure that all pages have 
been invalidated. 

To be sure that all copies of a page are actually 
invalidated, the host issuing a PURGE message has 
to be able to distinguish WRITEOK messages sent 
in response to the initial WRITE message, and also 
in response to subsequent PURGE messages. For 
this purpose, the host maintains a “phase counter,” 
which is an integer variable that is incremented ev- 
ery time the server multicasts a new set of WRITE 
or PURGE messages. Each such message includes 
the value of the phase counter, which the recipient 
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Figure 6: Write operation of the DSM protocol. 


echoes back in the WRITEOK response. 


Another requirement for correctness of the proto- 
col is that it should not lead to cycles in a page’s 
owner chain. To satisfy that requirement, once a 
PURGE message has been processed for a version of 
a page, a host must not again update its owner hint 
for that page in response to a WRITE message for 
the same version of that page. Actually, our proto- 
col uses the following stronger policy concerning the 
updates of owner hints: after processing a WRITE 
or PURGE message for a version of a page, a host 
will not again update its owner hint for that page in 
response to a WRITE message for the same version 
of that page. However, even if a host has previously 
received a WRITE message for a version of a page, 
it will still update its owner hint for that page in 
response to a PURGE message for a version of that 
page, as long as it is not aware of a more recent ver- 
sion of the page than that specified in the PURGE 
message. 


As mentioned previously, the copies hint held by 
the owner might not be accurate. For example, due 
to a slow network or a slow server, a host might 
retransmit a READ message before it receives the 
DATA response sent by the owner of the page in 
response to the original READ message. Since the 
owner does not keep track of the identity of hosts 
to which it sent copies of the page, it has no choice 
upon receiving a retransmitted READ message but 
to send another DATA message and increment its 
copies hint to maintain a conservative estimate. This 
leads to a possibility that a subsequent WRITE op- 
eration will deadlock, due to the fact that it will be 
impossible to obtain enough WRITEOK messages. 
Eventually, the host wanting to write will time out 
as described above. If the host has become the owner 
by the time the timeout occurs, it will multicast a 
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PURGE message as already described. However, 
there exists the possibility, due to a slow network 
or slow response from the previous owner, that by 
the time the timeout occurs, the host wanting to 
write has not yet become the owner. In this case, 
that host simply retransmits the WRITE message 
to the owner of the page, whose identity it knows 
because the fact that it has a copy of the page means 
that its owner hint is accurate. Figure 6c) illus- 
trates such a scenario, in which the previous owner 
has an inaccurate copies hint and the WRITEOK 
it sends is slow in arriving at the host performing 
the WRITE. When the retransmitted WRITE mes- 
sage arrives at the previous owner, it discards it be- 
cause the WRITEOK has already been sent. Even- 
tually, the host performing the WRITE will time 
out again. By this time, however, it has received 
the WRITEOK from the previous owner, and thus 
will proceed to a PURGE operation as above. 

In summary, the basic write protocol just de- 
scribed consists of a sequence of phases: an initial 
WRITE phase, then zero or more phases in which 
the WRITE message is retransmitted to the owner 
of the page, then zero or more PURGE phases. 

Message loss has basically the same effect for 
write operations as does uncertainty about the num- 
ber of copies of a page: the server does not receive 
enough WRITEOK messages and consequently can- 
not be sure whether its copy of the page is the only 
one in the system. Thus, the mechanisms used to 
handle inaccuracy of the copy hint also handle the 
loss of messages. 

In order to reduce the write time, we use one op- 
timization that is worth mentioning. As shown in 
Figure 6c), a very slow network or host can trig- 
ger a new WRITE/PURGE phase, due to the late 
arrival of a WRITEOK message at the host perform- 
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ing the WRITE/PURGE operation. If the situation 
persists, the host performing the WRITE/PURGE 
operation will never receive enough WRITEOK mes- 
sages for a phase before its timer for that phase ex- 
pires. To prevent such a scenario, a host performing 
a WRITE operation keeps track of the WRITEOK 
messages received for a fixed number of previous 
WRITE/PURGE phases, and it accepts WRITEOK 
messages sent in the scope of any such phase, even 
though it might have initiated several new phases. 


5 Experimental Results and Discus- 
sion 

We have run reasonably rigorous tests of the con- 
sistency protocol using some simple exerciser pro- 
grams, and we have measured some basic perfor- 
mance parameters. In this section, we describe the 
results of these tests. Important testing that we have 
not yet done is to use the system for a realistic ap- 
plication. 

In order to evaluate the performance of our im- 
plementation, we performed two sets of experiments: 
one to determine the basic costs of handling read and 
write faults on DSM pages, and another to assess 
the scalability of our protocol. In these experiments 
we used PC’s each with either Pentium 75 MHz or 
Pentium 100 MHz microprocessors, with 256 Khyte 
“write back” second level cache and 16 Mbyte of 
main memory. These machines are interconnected 
by an 100 Mbps Ethernet via a SMC EtherPower 
10/100 adapter, which sits in the PCI bus and sup- 
ports DMA. 

To determine the costs of both read and write 
faults, we ran a simple “ping-pong” exerciser pro- 
gram in which two clients running on different hosts 
alternately write a DSM page in such a way that 
the page bounces back-and-forth between the two, 
and there is no write contention. Table 1 shows 
both the minimum values and the average values 
over 100,000 operations, for both the read and write 
times, measured at the kernel level on machines with 
75 MHz processors. The read times give the actual 
time taken to handle a read fault by a client pro- 
cess. The write times include only the time taken to 
execute the invalidation protocol — in general, han- 
dling a write fault may also require an initial read 
operation to obtain a local copy of the page before 
beginning invalidation. 

These results are encouraging, they are within 1.5 
ms of the best published values that we know of 
[BB93], which were measured on a kernel implemen- 
tation using a network subsystem specially designed 
for performance that accesses the network interface 
directly, bypassing the UDP and other communica- 


tions software layers. Note however, that the values 
presented in that paper were measured using IBM 
RISC System/6000 Model 530, which run with a 
clock frequency of 25 MHZ, interconnected by point- 
to-point 220 Mbps optical fiber network. 

It is worth mentioning that a “ping” (ICMP echo 
request/response) between the machines we used in 
our experiments takes on average 0.44 ms, and the 
messages exchanged in a ping neither traverse the 
UDP layer nor are they processed at user level. We 
believe that improved performance would result due 
to reduced IPC costs, if the DSM server were moved 
into the kernel. 

Table 2 shows the breakdown of both the read 
and write faults times (again, these are averages 
over 100,000 operations). The IPC times, RDIPC 
and WR_IPC, measure the time from the moment 
the DSM pager sends a request to the local DSM 
server to the time that server starts processing that 
request. Thus it includes the time to send a mes- 
sage to the local host, the time needed for a con- 
text switch, the time to receive the message from a 
socket, the time spent by that message in a queue of 
messages in the DSM server, and the time required 
to do some preprocessing of the message. For this 
experiment, the message queue at the DSM server 
is empty, so the message is processed immediately. 
The DSM server times, RD.SRV and WR-SRV, are 
the times taken by the server to satisfy the DSM 
pager request; that is, the times taken to get a page 
from a remote server or to invalidate copies of the 
page in remote servers. 

Toassess the scalability of our algorithms we mea- 
sured the time to invalidate the remote copies of a 
page under conditions of no contention and when 
the number of copies of the page is equal to the 
number of members of the object. As one would 
hope, the invalidation time depends roughly linearly 
on the number of remote copies: our measurements 
showed a constant overhead of 1.3ms, plus an ad- 
ditional 0.4ms for each copy to be invalidated, over 
the range of 2 to 11 remote copies. Note that our 
current implementation sends out WRITE messages 
sequentially. We expect that using a multicast facil- 
ity for this would decrease the per-copy overhead. 

We alsoran some experiments under conditions of 
high write contention. These experiments revealed 
two potential problems with our current protocol. 
First, the slower machines (the 75MHz processors) 
tended to starve in favor of the faster machines (the 
100MHz processors). The second problem, which 
was exacerbated by the first, is that the simple time- 
out/retransmit policy with exponential backoff we 
currently use to handle message loss and deadlock 
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ie 


minimum time (ms) 





average time (over 100,000 values) (ms) 


Table 1: Basic read and write times measured at kernel level. 


| | RDIPC | RD.SRV | WRIPC | WR-SRV | 


average time (ms) 0.58 


2.1 0.57 1.4 


Table 2: Breakdown of the read and write times shown in Table 1. 


situations did not adapt well to varying loads, re- 
sulting in a large number of retransmitted messages 
under conditions of high contention. We are consid- 
ering ways of improving the timeout heuristics, and 
of modifying the protocol to alleviate the starvation 
problem. 

A basic aspect of our design that we did not eval- 
uate experimentally was our decision to use UDP 
rather than TCP for the consistency protocol. We 
continue to fee] that any simplifications in the proto- 
col that might be afforded by the use of TCP as an 
underlying reliable communications protocol would 
be more than offset by the overhead of additional 
acknowledgements, the loss of control over the re- 
transmission policy, and the need for an additional 
software layer to re-implement a message-based com- 
munication model on top of the stream-based TCP 
protocol. In addition, the use of TCP would not 
allow us to take advantage of multicast support pro- 
vided by IP. In spite of the above, to validate our be- 
lief in the superiority of datagrams over streams as 
an underlying protocol, it would probably be worth- 
while to perform some experiments in which we com- 
pare the performance of the UDP-based version of 
our protocol with a reasonably similar TCP-based 
version. 

Another interesting question concerns the impact 
on paging performance of our scheme for “pageout” 
of DSM pages by copying the data to the second 
object in the shadow chain. It is possible that the 
“second chance” this scheme gives to pages contain- 
ing DSM data could have unforseen interactions with 
the pre-existing page replacement policy. To exam- 
ine these questions, we would have to test our system 
with a realistic application, under conditions that 
would cause heavy pageout to the disk. We have 
not yet performed such tests. 


6 Related Work 


Research in the area of DSM systems has been 
very intense and there is an extensive literature 
[Esk96]. We compare our system to other soft- 
ware DSM implementations that support a sequen- 
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tial consistency model. The main points that we 
feel distinguish our facility are: (1) The consistency 
protocol is a lightweight, distributed protocol, which 
uses unreliable datagrams, but which is robust with 
respect to message loss, reordering, or duplication. 
(2) The facility is for a version of Unix (FreeBSD 
2.1) for which source code is readily available and 
which runs on commodity hardware. (3) The clean 
interface between the user-level server and the ker- 
nel should facilitate experimentation with a variety 
of DSM protocols. 

Most of the first implementations of software 
DSM systems, including that of IVY [Li88], the DSM 
system for Clouds [RI89] and Mirage [FP89], were 
implemented in operating systems different from 
Unix and the consistency protocols used assumed 
reliable communications. Mether [MF89] is the ex- 
ception among early implementations. It is a kernel 
level implementation of DSM for SunOS 4.0. Al- 
though it uses UDP, it relies on HW support for 
error correction. A more recent implementation of 
Mirage [FHJ94], although for the AIX operating sys- 
tem, uses reliable communication services. Further- 
more, every page request has to be sent to the page’s 
manager, which sends it to the current owner of the 
page. 

The DSM systems described in [FBS89] and 
[AAO92] take advantage of the VM external pager 
interface provided by Mach and CHORUS micro- 
kernels, respectively. The consistency protocol of 
Mach’s DSM uses only a point-to-point reliable com- 
munication service, in contrast to ours which uses 
multicast and unreliable communication services. 
Chorus’s DSM uses one of Li’s dynamic manager 
distributed algorithms with page invalidation [LH89] 
but the authors do not specify which and their de- 
scription of the protocol is rather incomplete. In 
addition, they do not provide details with respect to 
the kind of communication services used for IPC. As 
does our DSM system, Chorus’ DSM supports pag- 
ing out pages to disk, but, in contrast to our system, 
paging out is handled by the object manager. 


Both DVSM6K [BB93], a DSM system developed 
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for AIX v3, and the DSM system developed for the 
TOPSY multicomputer [SWS92] have an architec- 
ture very similar to that of our system. However, 
the latter was designed for a distributed memory 
multiprocessor system using a multiprocessor oper- 
ating system, and DVSM6K assumes that the com- 
munication system provides reliable communication, 
ie. in-order delivery, no message loss and no data 
corruption. 


7 Conclusion and Future Work 

In this paper we described a DSM facility, 
supporting a sequential consistency model, for 
FreeBSD, a freely and widely available version of 
Unix. We believe that this facility meets most of 
our design goals. The consistency protocol is a 
lightweight protocol that uses only UDP/IP, but is 
nevertheless tolerant to both message reordering and 
message loss. We were able to define a very sim- 
ple client application interface based on the Unix 
mmap() interface. One of the most successful as- 
pects of our design is its smooth integration into the 
VM subsystem of FreeBSD, which required very lit- 
tle in the way of modifications to existing code. We 
believe that it should be possible, with minimal ef- 
fort, to port this code to other Unix systems, such 
as OSF/1, with Mach-based VM subsystems. 

Besides improving the performance of our system 
in ways that have already been discussed, we are 
interested in using the facility for real applications. 
We are especially interested in the idea of using DSM 
as a tool for programming distributed applications, 
rather than for concurrent computation, which has 
been the focus of most DSM research. 


8 System Availability 

We are making our code available to anyone in- 
terested under a Berkeley-style copyright and li- 
cense. The code may be obtained via the URL: 
http: //www.cs.sunysb.edu/” stark/, or by mail- 
ing to one of the authors. 


9 Acknowledgements 

We wish to thank Professor Tzi-cker Chiueh for 
making his laboratory facilities available to us, as 
well as the other members of the Experimental Com- 
puter Systems Laboratory for their generous coop- 
eration in the sharing of these facilities. 


References 

[AAO92] _ V. Abrosimov, F. Armand, and MI. Or- 
tega. A Distributed Consistency Server 
for the CHORUS System. In Proc. 
of the Symposium on Experiences with 
Distributed and Multiprocessor Systems 


[ATT90] 


[BB93] 


[Esk96] 


[FBS89] 


[FHJ94] 


[FP89] 


[Lam79] 


[LH89] 


[Lig] 


(SEDMS III), pages 129-148. USENIX, 
March 1992. 


ATT. UNIX SYSTEM V Release 4 - 
Programmers Guide: System Services 
and Application Packaging Tools. Unix 
Press, 1990. 


Marion L. Blount and Maria Butrico. 
DSVM6K: Distributed Shared Virtual 
Memory on the RISC System/6000. 
In Proc. of the 38th IEEE Interna- 
tional Computer Conference (COMP- 
COM Spring 93), pages 491-500. IEEE, 
February 1993. 


M. Rasit Eskicioglu. A Comprehen- 
sive Bibliography of Distributed Shared 
Memory. Operating Systems Review, 
30(1):71-96, January 1996. 


A. Forin, J. Barrera, and R. Sanzi. The 
Shared Memory Server. In Proc. of 
the Winter 1989 USENIX Conference, 
pages 229-243. USENIX, January 1989. 


B. D. Fleisch, RL. Hyde, and N. C. 
Juul. MIRAGE+: A Kernel Imple- 
mentation of Distributed Shared Mem- 
ory on a Network of Personal Comput- 
ers. Software - Practice and Experience, 
10(24):887-909, October 1994. 


B. D. Fleisch and G. J. Popek. Mirage: 
A Coherent Distributed Shared Memory 
Design. In Proc. of 12th ACM Sympo- 
sium on Operating Systems Principles 
(SOSP’89), pages 211-223, December 
1989. 


L. Lamport. How to make a multipro- 
cessor computer that correctly executes 
multiprocess programs. [EEE Transac- 
tions on Computers, C28(9):690-691, 
September 1979. 


Kai Li and Paul Hudak. Memory Co- 
herence in Shared Virtual Memory Sys- 
tems. ACM Transactions on Computer 
Systems, 7(4):321-359, November 1989. 


Kai Li. IVY: A shared virtual memory 
system for parallel computing. In Proc. 
of the 1988 International Conference on 
Parallel Processing, pages 94-101, Au- 
gust 1988. 


1997 Annual Technical Conference 


161 


162 


[MBKQ96] Marshall Kirk McKusick, Keith Bostic, 


[MF89] 


[RK89] 


[Swsg2] 


[Tev87] 


Michael J. Karels, and John S. Quarter- 
man. The Design and Implementation of 
the 4.4 BSD Operating System. Addison 
Wesley, 1996. 


Ronald G. Minnich and David J. Far- 
ber. The Mether System: Distributed 
Shared Memory for SunOS 4.0. In Proc. 
of the Summer 1989 USENIX Confer- 
ence, pages 51-60. USENIX, June 1989. 


U. Ramachandran and M. Y. A. Kha- 
lidi. An Implementation of Distribu- 
ted Shared Memory. In Pvoc. of the 
Workshop on Experiences with Distribu- 
ted and Multiprocessor Systems, pages 
21-38. USENIX, October 1989. 


T. Stiemerling, T. Wilkinson, and 
A. Saulsbury. Implementing DVSM on 
the TOPSY Multicomputer. In Proc. 
of the Symposium on Experiences with 
Distributed and Multiprocessor Systems 
(SEDMS III), pages 263-279. USENIX, 
March 1992. 


Avadis Tevanian. Architecture-Indepen- 
dent Virtual Memory Management for 
Parallel and Distributed Environments. 
PhD thesis, Department of Computer 
Science, Carnegie Mellon University, 
December 1987. 


1997 Annual Technical Conference 


USENIX Association 


USENIX Association 


Cdt: A General and Efficient Container Data Type Library 


Kiem-Phong Vo 


AT&T Labs 
600 Mountain Ave 
Murray Hill, NJ 07974 
kpv@research. att.com 


Abstract 


Cdt is a container data type library that 
provides a uniform set of operations to 
manage dictionaries based on the common 
storage methods: list, stack, queue, set and 
ordered set. It is implemented on top of 
linked lists, hash tables and splay trees. 
Applications can dynamically change both 
object description and storage methods so 
that abstract operations can be exactly 
matched with run-time requirements to op- 
timize performance. This paper briefly 
overviews Cdi and presents a performance 
study comparing it to other popular con- 
tainer data type packages. 


1 Introduction 


A data structure to store objects is a container data 
type and an instance of it is a container or a dic- 
tionary. Common examples of container data types 
are: balanced and splay trees[1, 9, 10], skip lists[8], 
hash tables, stacks, queues and lists[2). Each data 
type has unique operational and performance prop- 
erties. For example, inserting an object into a stack 
is restricted to the stack top and takes constant time 
while inserting an object into a set of ordered ob- 
jects represented by a balanced tree takes logarith- 
mic time. 


Container data structures are pervasive in programs 
and there are many library packages to deal with 
them. Most Unix/C environments include the func- 
tions tsearch, hsearch, lsearch, and bsearch to ma- 
nipulate respectively objects stored in binary trees, 
hash tables, arrays and sorted arrays. C++ provides 
classes such as Map [3] and Set (5] to deal with or- 
dered maps and unordered sets. Recently, a new set 
of C++ templates for ordered and unordered maps 
and sets called the Standard Template Library(6] has 
become increasingly popular. 


Two common problems with existing container data 
type packages are interface confusion and perfor- 
mance deficiency. At the interface level, other than 
similar names, the Unix/C search functions have lit- 
tle else in common in how they manage objects. This 
is disconcerting because all container data types sup- 
port the same basic set of abstract operations: in- 
sert, delete, search, and iterate. When such oper- 
ations are realized via distinct interfaces and con- 
flicting conventions, programmers have a hard time 
learning and programming them. The STL tem- 
plates alleviate this problem by establishing inter- 
face guidelines for similar operations in different con- 
tainer data types. However, it is difficult with the 
STL templates to manage objects in multiple con- 
texts because container and object types must be 
statically bound together to form dictionaries. Aside 
from interface issues, we shall see later that not only 
the older Unix/C packages but also the more modern 
C++ packages do not always perform well in both 
time and space usage. 


This paper introduces Cd, a C library for managing 
dictionaries based on common container data types: 
list, stack, queue, ordered set, ordered multiset, un- 
ordered set and unordered multiset. Cdt is unique 
among container packages in possessing the follow- 
ing characteristics: 


e All dictionaries are manipulated via a uniform 
set of operations regardless of storage methods 
(i.e., container data types); 


e Storage methods can be dynamically changed, 
for example, to turn an unordered dictionary to 
an ordered one; 


e Object attributes are described in discipline 
structures that support both set-like dictionar- 
ies, i.e., dictionaries that identify objects by 
matching, and map-like dictionaries, i.e., dic- 
tionaries that identify objects by keys; 


e Discipline structures can be dynamically 
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changed, for example, to change object com- 
parators; 


e Objects can be in multiple dictionaries, includ- 
ing dictionaries in shared memory; and 


e Iterations are done directly over objects, ie, no 
separate iterator types [6] needed. 


Cdt uses splay trees for ordered sets and multi- 
sets, hash tables with move-to-front collision chains 
for unordered sets and multisets, and doubly linked 
lists for stacks, queues, and lists. The overall in- 
terface follows a method and discipline architec- 
ture [4, 11, 12]. Briefly, a library in this architec- 
ture provides handles and operations to hold and 
manipulate resource, methods to parameterize the 
semantics and performance characteristics of opera- 
tions, and a discipline structure type that applica- 
tions can use to define resource attributes and acqui- 
sition. Applying this architecture to C'dt, a handle is 
a dictionary and operations include handle creation, 
object insert, search, delete, etc. A Cdt method 
maps to a container data type while a discipline lets 
applications define information and operations that 
pertain directly to objects such as key type and ob- 
ject allocation. 


2 The Cdt library 


Objects are managed via three data types: dictio- 
nary, discipline, and method. 


e Dictionary: A dictionary stores objects. 


e Method: A dictionary has a method to define 
how objects are stored within it. Available 
methods are: Dtset for unordered sets, Dtbag 
for unordered multisets, Dtoset for ordered sets, 
Dtobag for ordered multisets, Dtlist for doubly 
linked lists, Dtstack for stacks and Dtqueue for 
queues. 


e Discipline: Each dictionary has an application- 
defined discipline structure that specifies object 
comparison, hashing, allocation, and event an- 
nouncement. 


2.1 Dictionary operations 


This section briefly overviews the main functions in 
the Ct library. 
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Dt_t* dtopen(Dtdisc_t* disc, Dtmethod_t* meth) 
creates a dictionary of type Dt_t with the given 
discipline disc and method meth. A dictionary 
dt is closed or cleared with dtclose(Dt_t* dt) and 
dtclear(Dt_t* dt). 


Functions Void_t* dtsearch(Dt_t* dt, Void_t* obj) 
and Void_t* dtmatch(Dt_t* dt, char* key) search 
the dictionary dt for an object matching respectively 
obj or key. Such a matched object becomes a cur- 
rent object with special semantics in certain opera- 
tions discussed below. Void.t is defined as void for 
ANSI-C or C++ and char for older C variants so 
it is suitable for exchanging addresses between the 
library and applications. 


Void_t* dtinsert(Dt_t* dt, Void_t* obj) inserts an 
object obj into the dictionary dt. Methods Dtset 
and Dtoset allow obj to be inserted only if there is 
no matching object already in dt. Other methods 
always insert a new object because they allow in- 
sertion of equal objects. Method Dtstack inserts ob- 
jects at stack top. Method Dtqueue inserts objects at 
queue tail. Method Dtlist inserts an object before 
the current object of dt if there is one, or at list head 
otherwise. An inserted or found object becomes the 
new current object. 


Void.t* dtdelete(Dt_t* dt,Void.t* obj) is used to 
delete from dt an object matching obj if one exists. 
dtdelete(dt,NULL) works with Dtstack and Dtqueue 
and removes respectively the top or head object. 


Object iteration depends on a particular object or- 
dering defined by the storage method in use. For 
Dtoset and Dtobag, objects are ordered by object 
comparisons. For Dtstack, objects are ordered in 
the reverse order of insertion. For Dtqueue, objects 
are ordered in the order of insertion. For Dtlist, 
objects are ordered by their list positions. For Dtset 
and Dtbag, the object order is defined at the point 
of use and may change on any search or insert oper- 
ation. 


There are many ways to iterate over objects in a 
dictionary. The below loop iterates forward over all 
objects in a dictionary dt: 


for(o = dtfirst(dt); 0; o = dtnext(dt,o) ) 


Alternatively, the below loop can be used to iterate 
backward over objects: 


for(o = dtlast(dt); 0; o = dtprev(dt,o) ) 
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2.2 Storage methods 


A storage method is of type Dtmethod_t and defines 
how objects are manipulated. Cdt provides the fol- 
lowing storage methods: 


e Dtset and Dtbag: These methods are based on 
hash tables with move-to-front collision chains. 
Dtset stores unique objects while Dtbag allows 
repeatable objects (i.e., objects that compare 
equal). Repeatable objects are collected to- 
gether so that any iteration always passes over 
sections of them. Object accesses take expected 
O(1) time given a good hash function. 


e Dtoset and Dtobag: These methods store or- 
dered objects in top-down splay trees. Dtoset 
stores unique objects while Dtobag allows re- 
peatable objects. Object accesses take amor- 
tized O(logn) time. Splay trees adapt well to 
biased access patterns because frequently ac- 
cessed objects migrate closer to tree roots. 


e Dtlist: This method stores repeatable objects 
in a doubly-linked list. An object is always in- 
serted in front of the current object which is 
either the list head or established by a search, 
insert, or iteration. Object insertion and dele- 
tion are done in O(1) time. 


e Dtstack and Dtqueue: These methods store re- 
peatable objects in stack and queue order. In 
a stack order, objects are kept in reverse order 
of their insertion. In a queue order, objects are 
kept in order of their insertion. Object insertion 
and deletion are done in O(1) time. 


2.3 Disciplines 


A discipline structure is of type Dtdisc.t. Applica- 
tions use disciplines to define object attributes such 
as comparison, hashing, and allocation. 


Figure 1 shows Dtdisc-_t. Dtdisct.key and 
Dtdisc_t.size identify a key of type Void.t* used 
for object comparison or hashing. Dtdisc_t.key de- 
fines the offset in an object where the key resides. 
Dtdisc.t.size defines the key type. A positive value 
means that the key is a byte array of given length, 
a zero value means that the key is a null-terminated 
string, and a negative value means that the key is 
a null-terminated string whose address is stored at 
the key offset. 


typedef struct 


{ int key; /* key offset */ 
int size; /* key size/type */ 
int link; /* object holder */ 
Dtmake_f nakef; /* object makef */ 
Dtfree_f freef; /* object freef */ 
Dtcompar_f comparf; /* comparator */ 
Dthash_f hashf; /* hash function */ 
Dtmemory_f memoryf; /* allocator */ 
Dtevent_f eventf; /* event handler */ 

} Dtdisc_t; 


Figure 1: A discipline structure 


Objects are held in a dictionary via holders of type 
Dtlink_t. If Dtdisct.link is negative, the library 
will allocate object holders. Otherwise, the library 
assumes that object holders are embedded inside ob- 
jects and Dtdisc_t.link defines the offset in an ob- 
ject where the holder resides. 


Dtdisc_t.makef and Dtdisc_t.freef, if defined, are 
called to make and free objects when they are in- 
serted or deleted. If Dtdisc_t.makef is not defined, 
then in the call dtinsert(dt,obj) obj itself will be 
inserted. 


If Dtdisc_t.comparf or Dtdisc_t.hashf are not de- 
fined, some internal functions are used. By allowing 
both key definition and compare function in a dis- 
cipline, both set-like and map-like dictionaries are 
supported. 


Dtdisc.t.memoryf, if defined, is used to allocate 
space. Dtdisc_t.eventf, if defined, announces var- 
ious events such as dictionary opening and closing 
and method or discipline changes. 


2.4 An example Cdt application 


A common container data type example is given in 
the Map associative array paper [3] and the Unix/C 
manual page for the function tsearch(). This ap- 
plication reads a text file, partitions it into tokens 
(strings separated by space, tab and new line char- 
acters), keeps frequency count for each token, and 
finally writes out the tokens and their frequencies. 


Figure 2 shows an implementation of the token 
counting example. Omitted are a few minor gram- 
matical statements and the function readtoken() to 
parse an input stream into tokens. The below com- 
ments are based on line numbers in the figure: 
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bw 


© 


10. 
11. 
12: 
13. 
14. 
15. 
16. 
L's 


18. 
19. 
20. 
21. 


22. 
23. 
24. 
25. 
26. 


27. 
28. 
29. 
30. 
31. 


132: 


. #include 


NOM P W 


<sfio.h> 
#include <cdt .h> 
typedef struct 

{  Dtlink_t link; 


char* token; 
int freq; 
} Token_t; 


Dtdisc_t Tkdisc = 
{ offsetof(Token_t,token), -1, 0 }; 


Token_t* newtoken(char* s) 

{  Token_t* tk; 
tk = malloc(sizeof(Token_t)); 
tk->token = malloc(strlen(s)+1); 
strcpy(tk->token,s); 
tk->freq = 1; 
return tk; 


} 


main() 
{  char* S; 
Token_t* tk; 
Dt_t* dt = dtopen(&Tkdisc,Dtset) ; 


while((s = readtoken(sfstdin)) ) 
{ if ((tk = dtmatch(dt,s)) ) 
tk->freq += 1; 
else dtinsert(dt,newtoken(s)); 


} 


for(tk = dtfirst(dt); tk; 
tk = dtnext(dt,tk) ) 
sfprintf(sfstdout,"%s:\tid\n", 
tk->str, tk->freq); 


Figure 2: Program to count tokens 


The header file sfio.h [4] declares I/O func- 
tions. cdt.h is the Cdt public header file and 
declares necessary types, values and functions. 


: Token_t is a structure to hold a string token and 


a frequency count freq. It also embeds the con- 
tainer holder structure in the link field. 


: The discipline Tkdisc describes attributes of 


Token_t objects. The ANSI-C macro of fsetof () 
defines the offset of Token_t.token in Token_t. 
Since Token_t.token points to a null-terminated 
string, Tkdisc.size is set to -1. Tkdisc.link is 
set to 0, the offset to Token_t.link in Token_t. 


newtoken() is a flinction to create a new Token_t 
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27-30: 


structure from a given string s. To simplify the 
exposition, error checks for the malloc calls were 
omitted. 


21: A new dictionary dt is created based on the dis- 
cipline Tkdisc and the method Dtset. Here it is 
assumed that tokens need not be sorted. Other- 
wise, Dtoset could be used (see also Section 4). 


Tokens are read and inserted into dt. Line 22 
uses dtmatch() to find out if a token matching 
the current read token already exists in dt. In 
that case, only its frequency count is updated. 
Otherwise, Line 24 creates and inserts a new 
token structure into dt. 


These lines loop over all tokens and output both 
tokens and their frequency counts. Note that 
this is done directly over objects without the 
aid of any iterator type [6]. 


3. Performance 


Among the various container data types, hash tables 
and binary trees are most common and also have 
large variation in implementation quality. This sec- 
tion presents results from a performance study that 
compared various set and map container structure 
packages based on hash tables and binary trees. 


3.1 Methodology 


The token counting application in Section 2.4 was 
used as a benchmark. To minimize implementation 
variation, a single program based on Figure 2 was 
written. Compile time options allowed switching 
usages of the Cdt methods Dtoset and Dtset, the 
Unix/C package tsearch, the C++ classes Set and 
Map, and the STL templates map and hashmap. All 
implementations used the same string comparison 
function, a variant of strcmp() that also keeps invo- 
cation count. In addition, all hash table implemen- 
tations used the same hash function supplied by Cat 
so that hash value computation would be uniform. 
This was necessary because the default hash func- 
tions in some of the packages were not very good. 
For example, the comparison counts in Section 3.2 
for the Set package would have been much higher if 
its default hash function was used. 


A variety of input files were used: 
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e ps: PostScript source of a technical paper, 
e src: an archive of C source code, 
e kjv: a King James version of the bible, 


e mboz: a personal mail archive, 


host: a database mapping IP addresses to ma- 
chine hosts, and 


city: a database mapping cities to area codes. 


ps | 1,989K [335,997 | 11,912] 38.00 | 





Table 1: Summary of benchmark input files 


Table 1 summarizes input file statistics: file size in 
K-bytes, total number of tokens, number of distinct 
tokens, and average length of a token. These input 
files represent a wide variety of data ranging from ps 
which has relatively few distinct tokens to city which 
has about 85% distinct tokens. Tokens in host and 
city are also highly ordered. 









Program 


et 
hashmap 
= 


Table 2: Sizes of benchmark programs 








tsearch 
Dtset+Dtoset 











The experiment was performed on a SPARC-20 run- 
ning SUN OS5.0. Table 2 shows the sizes of the 
benchmark programs. Both Dtset and Dtoset were 
combined in the same benchmark program with in- 
vocation options for method selection. Except for 
Map, other C++ packages caused the test code to 
be about twice as large as the C versions, a sign of 
code bloating due to the use of templates. Com- 
paring the results of compiling main(){} with C and 
C++ showed that only about 5K can be attributed 
to language difference. 


Program execution was done at night on a quiescent 
machine. Each time measurement was obtained by 


running the same test 9 times, computing total cpu 
and system times for each run, discarding the top 
two and bottom two scores to reduce variance, then 
averaging the remaining five scores. Space measure- 
ments were done by calling sbrk(0) before any dic- 
tionary was opened and after all output was done 
and computing differences in the return values. 


3.2 Hash table packages 


Below are brief descriptions of the container pack- 
ages that use hash tables to implement unordered 
sets and maps. The Unix/C hsearch package was 
omitted because it was too slow to measure. 


e Set: The C++ Set class that comes standard 
with our compiler. This uses a hash table with 
chaining to resolve collisions. 


hashmap: The C++ STL hashmap template. This 
uses a hash table with chaining to resolve colli- 
sions. 


Dtset Method Dtset of Cdt. This uses a hash 
table with chaining. The collision chains use a 
move-to-front heuristic to improve search time. 


Table 3: Hash: comparison counts in thousands 





Table 3 shows comparison counts for the hash table 
packages in units of thousands. hashmap performed 
worst, with comparison counts many times higher 
than that of Set and Dtset. Dtset asserted that to- 
kens compared equal must have the same hash val- 
ues. This fact was used effectively to reduce many 
comparisons because the hash function distinguished 
objects well. The low comparison counts for Set sug- 
gested that it might have used the same strategy as 
Dtset. Dtset retains a slight edge perhaps due to its 
move-to-front strategy on collision chains, 


Table 4 shows time performance. Poor comparison 
counts directly translated to poor computing time. 
hashmap was worst, sometimes up to a factor of 3 
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7.57 


Pare] 606, 


10.66 


Table 4: Hash: times in seconds 


[Dataset [Set | hashnap | Diset | 





Table 5: Hash: space in K-bytes 


slower than Dtset. Dtset was fastest among the three 
packages. 


Table 5 shows space usage. Dtset and hashmap used 
about the same amount of space. Set sometimes 
used twice as much space as the other packages. 


3.3. Binary tree packages 


A further requirement could be stimulated in the to- 
ken counting example that tokens must be output in 
a lexicographic order. In that case, a natural solu- 
tion is to use container packages that maintain or- 
dered tokens. Below are the studied container pack- 
ages for ordered sets and maps: 


e tsearch: The tsearch function in SUN OS5.4. 
This uses plain binary trees. 


e Map: The C++ Map class. This is based on AVL 
balanced trees. 


e map: The C++ STL map template. This uses 
red-black balanced trees. 


e Dtoset: Method Dtoset of Cdt. This uses top- 
down splay trees. 


Table 6 shows comparison counts for the binary 
tree methods. Except for city and host, tsearch 
performed well despite its simplistic data structure. 
This is because most datasets consist of more or less 
random tokens and binary trees built from such ran- 
dom data are naturally balanced. tsearch did poorly 
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Dataset 


3,533 
i 26,119 | 13,526 
mboz 


2,568 


Table 6: Tree: comparison counts in thousands 


Dataset | Map map Dtoset 
26.58] 61.00 | 13.76 


Table 7: Tree: times in seconds 








on city and host whose tokens were highly ordered. 
The balanced tree packages Map and map ignored any 
such ordering property in the data. Both packages 
used about the same number of comparisons with 
map having a slight edge. The splay tree approach 
in Dtoset took advantage of data ordering to reduce 
comparisons. As a result, Dtoset was the clear win- 
ner in all cases. 


Table 7 shows time performance. As with the hash 
table methods, comparison counts mapped directly 
to time. Dtoset was fastest, sometimes by a factor 
of 3 or more over some of the other methods. Note 
that Dtoset was even faster than the STL hashmap 
package which did not have to order tokens. 


Table 8 shows memory usage. tsearch and Map 

used more memory than other methods. Map’s extra 

space was due to the balancing data. The cause for 

tsearch’s poor memory usage was unclear although 

a memory trace using Vmalloc [12] revealed a mys- 

terious extra allocation after each holder allocation. 
map 


Map tsearch | Dtoset 
Src 4 | 1,424 1,872 | 1,432 
L 8 


86 
196 
3,008 | 2,224 [3,016 | 2232 | 
16 ’ 
84 


iy 
1 
2,224 


Table 8: Tree: space in K-bytes 
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4 Flexible programming with Cdt 


To output tokens in order, a strategy that often 
works better than just using Dtoset is as follows. 
First, Dtset is used to construct the token dictio- 
nary. Then, Dtoset is used right before outputting 
to sort tokens into the right order. To implement 
this strategy, the below line of code can be inserted 
before Line 26 of Figure 2: 


dtmethod(dt, Dtoset); 


ps | 324 1,636 498 


Src 129) 173 608 
kju 789 | 8,592 1,307 


mbox | 370 5,425 1,282 
L63T 
host 347 2,568 2,014 


Table 9: Tuning: comparison counts in thousands 





Table 9 shows comparison counts for the above strat- 
egy. Except for city and hosts, Dtset+Dtoset im- 
proves substantially over the exclusive use of Dtoset, 
up to 70% for kyu. 


132] 5.33 
359 | 9.53 5.12 
5 
















820 | 1585 
649 | 892 






10.78 | 13.76 18.38 


Table 10: Tuning: times in seconds 





Table 10 show time performance. Time usages for 
Dtset+Dtoset markedly improved over the lone use 
of Dtoset on most datasets except for city and hosts. 
For these datasets, though comparison counts went 
down somewhat, time measurements actually went 
up. This was because these datasets contained 
many distinct tokens and Dtoset ended up repeat- 
ing Dtset’s work. 


The above situation is common in practice. Pro- 
grams must often deal with data that have special 
characteristics. It is seldom the case that efficient 
algorithms can be devised to adapt smoothly to the 
data diversity and operate optimally in each spe- 
cial situation. Therefore, whenever possible, a good 
design principle is to let users select and combine 
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1. int freqcmp(Dt_t* dt, Void_t* argl, 

2. Void_t* arg2, Dtdisc_t* disc) 
sue int d; 

4, Token_t* ti = (Token_t*)arg1; 

5. Token_t* t2 = (Token_t*)arg2; 

6. if((d = t1->freq - t2->freq) != 0) 

Te return d; 

8, else return strcmp(t1~>token,t2->token) ; 
9. } 


10. Tkdisc.comparf = freqcmp; 

11. Tkdisc.key = Tkdisc.size = 0; 

12. dtdisc(dt,&Tkdisc,DT_SAMEHASHIDT_SAMECMP) ; 
13. dtmethod(dt,Dtoset); 


Figure 3: Order tokens by frequency 


computing methods to optimize processing based on 
specific knowledge of the data. Cdt simplifies doing 
this in the context of using container data types. In 
fact, the benchmark program was written to allow 
strategy selection at invocation time. It is not easy 
to do the same using the other container packages. 


As another example of Cdi’s flexibility, suppose that 
the output requirement is changed to ordering to- 
kens in increasing order of frequency. To do this, 
Figure 2 should be augmented with Lines 1-9 of Fig- 
ure 3 before main() and Lines 10-13 of the same fig- 
ure before Line 26. The below comments pertain to 
line numbers in Figure 3: 


1-9: The function freqcmp() compares tokens first 
by frequency, then by token names. So within a 
group of tokens with the same frequency, tokens 
will be ordered lexicographically. 


10: The comparator is redefined to be freqcmp(). 


11: Tkdisc.key and Tkdisc.size are set to 0 to 
indicate that Token.t objects will be com- 
pared whole instead of via the key strings 
Token_t.token 


12: dtdisc() is called to officially change the dis- 
cipline. Normally, a discipline change implies 
rearranging of objects because hash values may 
have changed or objects that used to compare 
disctint may have become equal. The flags 
DT_SAMEHASH and DT_SAMECMP tell dtdisc() that, 
in this case, both hash values and object com- 
parison remain unchanged. The latter is strictly 
untrue but it saves computation that would be 
done anyway on line 13. 
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13: dtmethod(dt ,Dtoset) is called to switch the stor- 
age method to Dtoset and sort tokens by the 
new comparator. 


Note that in this example it is possible to use method 
Dtoset with the comparator freqcmp() from the start 
of the application. However, doing so would have 
been prohibitively expensive because objects must 
be deleted and reinserted each time their frequen- 
cies are updated. Thus, for efficiency, it is neces- 
sary that Dtset is used during dictionary construc- 
tion and Dtoset is used only at the end before out- 
putting. 


5 Discussion 


This paper introduced Cdt, a container data type 
library. The library provides the common storage 
methods: set, multiset, ordered set, ordered multi- 
set, list, stack and queue which are often seen only 
in isolated packages. Cdt achieves an interface that 
keeps orthogonal the three design dimensions: dic- 
tionary operations, storage methods, and object de- 
scriptions. This is a goal attempted but not quite 
achieved by other recent work on reusable compo- 
nents such as the C++ Standard Template Library. 


Many contemporary container libraries are unwieldy 
because their interfaces are not sufficiently abstract 
and operations are tied too closely to container data 
types. At worst, this leads to divergent interfaces for 
the same basic operations as shown by the Unix/C 
search functions. Even with better interface design 
as in the STL case, the close tie between implemen- 
tation techniques and abstract interfaces can reduce 
the generality of the library. For example, instead 
of a general container template that can be param- 
eterized by storage methods, STL provides various 
container templates such as hashmap and map that are 
strongly bound to minimal object requirements ac- 
cording to the respective implementation techniques. 
As a result, although hashmap and map provide simi- 
lar functions they require objects with different type 
specifications. This means that there is no simple 
way to dynamically convert a hashmap container to a 
map container in the style discussed in Section 4. Cdt 
avoids such interface limitations by making dictio- 
nary operations completely abtract and parametriz- 
able by methods and disciplines which are orthog- 
onal and mutable attributes of dictionaries. The 
method and discipline architecture naturally lifts a 
library interface to its most general level. Perhaps 
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some future STL work can benefit from such an in- 
terface analysis and design. 


Cat disciplines are run-time structures used to define 
object attributes such as keys, comparison, hashing, 
and allocation. By allowing both comparators and 
keys, Cdt generalizes set-like and map-like container 
packages. This leads to a unifying interface to man- 
age such containers. Using run-time structures for 
type definition means losing certain services com- 
mon to C++ templates such as static type checking 
and inlining of comparison functions. The loss of 
static type checking is balanced out by the added 
programming flexibility. For example, Cdé allows 
the same objects to be described in multiple ways 
and both disciplines (i.e., object types) and meth- 
ods (i.e, container types) can be arbitrarily mixed 
and changed. The efficiency loss resulted from no in- 
lining of object comparisons is compensated by the 
advantage of having a single library code image and 
consequent code size reduction as exemplified in Ta- 
ble 2. A single code image also makes possible using 
Cdt as a dynamically loadable shared library. Fur- 
ther, for applications such as the discussed token 
counting example which require relatively complex 
objects, any function call overhead to compare two 
objects would be negligible relative to the cost of the 
comparison itself. 


A performance study showed that Cdt methods 
Dtoset and Dtset performed as well or better than 
their counterparts in other C and C++ container li- 
braries including the modern STL components. The 
Cdt methods consistently used about the same or 
less space than other packages while they were faster 
than other packages by up to a factor of two or more. 
The use of splay trees and hash tables with self- 
adjusting collision chains enable these methods to 
perform well in a wide range of input data. Ex- 
amples were given showing how further performance 
gains can be made with selective matching of disci- 
plines and methods at run time. 


Cdt is a descendant of Libdict {7]. It is a ma- 
ture library and has been used in many applica- 
tions including large-scale information systems that 
routinely handle dictionaries with tens to hundreds 
thousands of objects. 
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Code availability 
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Abstract 


deet is a simple but powerful debugger for ANSI C and 
Java. It differs from conventional debuggers in that it 
is machine-independent, graphical, programmable, dis- 
tributed, extensible, and small. Low-level operations are 
performed by communicating with a “nub,” which is a 
small set of machine-dependent functions that are em- 
bedded in the target program at compile-time, or are im- 
plemented on top of existing debuggers. deet has a 
set of commands that communicate with the target’s nub. 
The target and deet communicate by passing messages 
through a pipe or socket, so they can be on a different 
machines. deet is implemented in tksh, an exten- 
sion of the Korn shell that provides the graphical facil- 
ities of Tcl/Tk. Users can browse source files, set break- 
points, watch variables, and examine data structures by 
pointing and clicking. Additional facilities, like condi- 
tional breakpoints, can be written in either Tcl or the 
shell. Most debuggers are large and complicated, deet 
is less than 1,500 lines of shell plus a few hundred lines 
of machine-specific nub code. It is thus easy to under- 
stand, modify, and extend. We describe an implementa- 
tion of the nub API for Java and an implementation that 
is layered on top of gdb. We have also implemented a 
version of gdb using the nub API, which demonstrates 
the modularity of the design. 


1 Introduction 


Traditional UNIX debuggers are indispensable tools for 
locating and fixing program errors. Despite their impor- 
tance and pervasiveness, they continue to harbor inad- 
equacies that limit their usability. For example, UNIX 
debuggers typically have textual user interfaces that are 
cryptic at best. When debuggers are hard to use, pro- 
grammers tend to litter their programs with print state- 
ments instead of using a debugger. 

While most PC debuggers run on only one operat- 
ing system and one architecture, UNIX debuggers must 
deal with portability issues. Debuggers are notoriously 
machine-dependent programs; they depend on the tar- 


get architecture, operating system, compiler, and linker. 
Thus, porting a debugger from one variant of UNIX 
to another can require a substantial amount of effort. 
For example, about one-third of gdb’s source code is 
machine-dependent. 

Few debuggers have programming facilities in which, 
for example, programmers can write application-specific 
debugging code. Such code is useful for nontrivial 
queries of data structures, such as displaying the second 
to last element ina linked list, or all the positive elements 
in an array. Other examples include setting conditional 
breakpoints and automating program testing. Debuggers 
that support programming facilities do exist, but often 
the language is idiosyncratic to either the debugger or 
the source language, or both, and hard to learn. 

Most debuggers are large and complex programs; for 
example, gdb [14] is about 150,000 lines of C. This 
complexity has some unfortunate consequences. First, 
debuggers are often themselves buggy, because, like any 
large program, their complexity and size makes them 
prone to errors and to inconsistent behaviors on differ- 
ent platforms. Second, debuggers are usually difficult 
to extend, because their implementations may be hard to 
understand and to modify. 

deet (desktop error elimination fool) addresses these 
shortcomings. It provides both textual and graphical in- 
terfaces to make it easy to use. Users can perform most 
debugging actions by pointing and clicking, and data 
structures can be displayed graphically. The GUI is writ- 
ten with Tk [12]. deet is also programmable: Its ca- 
pabilities can be extended by writing in either Tcl or in 
tksh, a variant of the Korn shell [8]. 

Nearly all of deet’s implementation is machine- 
independent. It uses a small “‘nub” that provides facilities 
for communicating with the debugger and controlling the 
target. The nub-based approach permits deet to debug 
a target running on another machine. Figure 1 shows the 
screen of a typical debugging session. deet doesn’t at- 
tempt to match debuggers like gdb feature-for-feature; 
for example, deet can’t examine core dumps, evalu- 
ate arbitrary C expressions, or debug at the assembly- 
language level. Nevertheless, its implementation is sur- 
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Figure 1: deet screen dump 


prisingly simple. Its complete source is approximately 
2,500 lines of shell and C. 


2 Using deet 


deet’s features are best explained by seeing it in action. 
First, the target program is compiled by lec [3] with the 
appropriate debugging option to embed the nub in the 
target: 


$ Icc ~Wf-g4 wf.c lookup.c 


Here and in the displays below, slanted type iden- 
tifies user input. When the generated a.out is exe- 
cuted, the debugger specified by the environment vari- 
able DEBUGGER is also started, so 


S$ DEBUGGER=deet a.out 


starts both a. out and deet. At this point, the source 
window shown in Figure 2 appears. The user is prompted 
for textual deet commands in the shell window from 


which a. out was invoked, but most debugging actions 
are performed with the mouse. 

Single-clicking on a line highlights that line if it con- 
tains a breakpoint; double-clicking on the line sets the 
breakpoint. deet can set breakpoints on expressions, 
Not just statements, so there may be more than one break- 
point in a line. When a line has multiple breakpoints, 
double-clicking sets the breakpoint closest to the cursor. 
Breakpoints are indicated in the window in lighter shad- 
ing or in yellow (see Figure 2). Double-clicking on a 
breakpoint that has already been set removes the break- 
point. 

The breakpoints window, like the one shown in Fig- 
ure 3, displays a list of all breakpoints and related infor- 
mation about each breakpoint, such as its location and 
break condition. These conditions are deet expressions 
that are evaluated whenever the breakpoint is reached; 
if the condition is true, the target stops. When the tar- 
get stops at a breakpoint, the current source window 
shows the file and line number of the breakpoint, and re- 
verse video highlights the line containing the breakpoint. 
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Figure 2: Source window 


A condition can be changed by highlighting the break- 
point in the breakpoints window and editing the condi- 
tion field, and a breakpoint can also be removed by click- 
ing the “Delete” button in this window. 

The stack can be shown by clicking on the “Stack” but- 
ton in the source window (see Figure 2). This displays a 
new window that shows each frame on the stack, from 
the top down, as illustrated in the right middle portion 
of Figure 1. An individual frame can be selected, and 
clicking a button in the stack window performs the corre- 
sponding action on that frame. For instance, by clicking 
the “Dump” button, the names and values of the parame- 
ters and locals for that frame are displayed. Clicking the 
“OK” button causes the source window to display the file 
and line number of the call to the selected frame. 

Highlighting a variable in the source window and 
clicking “Print” causes a pop-up window to display the 
value of that variable (see the upper right corner of Fig- 
ure 2). If the variable is a pointer, a structure, or a union, 
double-clicking on the variable expands its value. For 
pointers, the value of the referent is displayed; for struc- 
tures and unions, the values of the fields are displayed. 
deet also displays the values of variables in balloon 
help pop-up windows when the cursor is left on top of 
the variables for sufficient time, similar to Microsoft’s 
Visual C++ debugger [1 1]. 

A variable can be modified by clicking “Modify” in 
the variable window, which prompts the user to enter 
a new value. A variable may also be watched, which 
causes its value to be displayed in the variable window 
and updated as execution passes each potential break- 
point. 


Figure 3: Breakpoints window 


Commands may be typed at the debugger in the shell 
window in a manner similar to gdb. For instance, the 
breakpoints command displays the current break- 
points: 


deet> breakpoints 

The following breakpoints are set: 
File test/wf.c, line 4 char 28 
File test/lookup.c, line 14 char 50 


Commands are just tksh commands, so shell com- 
mands like history, pwd, and make can be entered 
as well. 

Most of the state in a deet debugging session can be 
saved and restored later in a subsequent, separate debug- 
ging session. This state includes breakpoints and their 
conditions, locations of files, and user-defined tksh 
functions. deet saves the state by writing a shell script 
that can be interpreted to restore the state. 


3. Design 


deet divides cleanly into two parts: One part interacts 
with the programmer, and the other part interacts with 
the target program. The user-interface part is written in 
tksh, a version of the new Korn shell [2, 7] that has 
been extended to support Tcl [12]. The target program 
is controlled by a nub, which provides debugging primi- 
tives, as detailed below. deet’s implementation of and 
interaction with the nub is also written in tksh. Thus, 
programmers can modify and extend both parts of deet 
by writing tksh code. 
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Figure 4: cdb’s design 


3.1 Cdb and deet 


deet is based on cdb [6]. cdb is a machine indepen- 
dent debugger that eliminates machine dependencies by 
adding a small amount of information into the target pro- 
gram at compile time. cdb communicates with the target 
through a nub—a small machine-independent interface 
that constitutes the core functionality of the debugger, as 
suggested by Figure 4. The nub implementation can be 
made machine-independent, as cdb shows, but machine- 
dependent implementations are also possible and are un- 
doubtedly more efficient. The nub is small enough that 
re-implementing it for new platforms is nearly as easy as 
porting the machine-independent implementation. 
There are four components of cdb: 


1. The nub interface, which stands bet ween the debug- 
ger and machine-dependent target manipulations. 


2. The nub implementation, which consists of the 
nub interface functions, special code emitted by 
the compiler to support the nub, and a wrapper 
around the linker to load the nub and the machine- 
independent symbol table. 


3. Amachine-independent symbol table format, which 
is emitted by the compiler and linked into the target. 


4. A simple, text-based debugger that uses the nub to 
provide minimal functionality; this debugger is in- 
tended to be replaced with more sophisticated de- 
buggers, like deet. 


Any of the last three components can be replaced with 
alternative implementations. For instance, the nub can be 
replaced with a machine-dependent implementation that 
uses the ptrace system call like most UNIX-specific 
debuggers, or by one that is layered on top of gdb. The 
machine-independent symbol tables could be replaced 
with the usual machine-dependent “stab” symbol tables 
embedded in UNIX executables. Finally, the debugger it- 
self could be replaced with any program that uses the nub 
interface. deet is a replacement for this fourth compo- 
nent. Using deet does not directly involve changes to 
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any of the other components, but implementing deet 
did induce additions to the nub and to the symbol-table 
format beyond their original designs. 

deet is written in tksh, which includes a C library 
that can be used to manipulate the state of the Tcl inter- 
preter, such as reading and writing variables and creat- 
ing new built-in commands. tksh can run any library 
written on top of the Tcl library, which includes the Tk 
graphics library. Thus, Tk commands, like button and 
pack, can be invoked from tksh scripts. 

tksh should be thought of as an extension to Tcl 
rather than as an alternative to it. tksh allows Tcl 
scripts to be run directly with the source command. 
Tcl scripts share variables and functions with tksh, al- 
lowing Tcl scripts to work with shell scripts. 

tksh is used as the debugging language for deet 
primarily because of its strengths as an interactive com- 
mand language. Debuggers are interactive programs. 
deet takes advantage of the interactive facilities of 
tksh, such as command-line editing, job control and 
pipelines. Using the command-line interface to deet 
feels like using a shell because the debugger itself is an 
extension of ksh. tksh also offers two familiar, high- 
level languages. Many programmers already know how 
to write shell and Tcl scripts, which is 90% of what’s 
needed to use deet. However, a perl or python pro- 
grammer could rewrite the deet front end and still use 
existing nub implementations. 

deet also includes additional built-in commands for 
debugging. deet’s code is simpler to understand than 
the corresponding C code would be, because it’s written 
in a high-level language. deet can also can be modified 
during a debugging session to suit specific applications. 


3.2. The Nub Interface 


The nub interface is designed to be as small as possi- 
ble while supporting the fundamental debugging oper- 
ations common to all debuggers [6]. Figure 5 summa- 
rizes the complete API. The nub does not support high- 
level facilities, such as expression evaluation or specific 
symbol-table formats, because these facilities can be im- 
plemented by other interfaces or by debuggers them- 
selves. 

.Nub_set and .Nub_remove set and remove break- 
points, which are specified by a file name, line number, 
and character position. Unlike most debuggers, break- 
points specify the locations of expressions, not lines. So, 
for example, it is possible to set a breakpoint on the in- 
crement part of a C for loop. 

_Nub-_src accepts incomplete breakpoints, in which 
any of the file name, line number, or character position 
are omitted, and invokes a debugger callback on all pos- 
sible breakpoints that “match” the incomplete one. deet 
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Nub-init 
Nub-set 
Nub_remove 
-Nub-_sre 
Nub_frame 
Nub_fetch 
Nub.store 


initialize the nub 

set a breakpoint 

remove a breakpoint 

visit breakpoints with a given pattern 
move to a specific stack frame 

read the target’s memory 

write the target’s memory 


Figure 5: The nub interface 


uses this function to determine which breakpoint to set 
when a user clicks on a line. 

-Nub_fetch and Nub-store access the target’s 
memory. They accept a buffer address, a byte count, and 
an address space identifier, and read/write data from/to 
the target. The address space identifier may specify an 
operating-system address space, such as the text or code 
segments. It can also specify logical address spaces that 
may not be part of the target, like the symbol table, for 
example. It is the nub’s responsibility to access the ap- 
propriate data. Debuggers can view all data about the 
target as if they were stored in memory. 

1lcc emits machine-independent symbol tables in the 
target’s address space, and _Nub-fetch reads these 
data. Another cdb-specific, but machine-independent, 
interface provides a higher-level view of the C symbol 
table as an inverted tree of symbol objects. If we were 
using the nub with, say, Modula-3, this high-level in- 
terface would have to be replaced with one specific to 
Modula-3, and that interface would use Nub_-fetch to 
read the symbol table generated by the Modula-3 com- 
piler. There’s nothing special about 1cc; other C com- 
pilers could be used given an appropriate nub implemen- 
tation. It’s the nub that’s the critical component, not the 
compiler. 


3.3. The deet Nub Interface 


deet includes versions of the nub and symbol-table 
functions for use with Tcl or tksh. These tksh com- 
mands differ from the C routines in two ways: They are 
at a higher level, because they manipulate source-level 
symbols, types, and values, and they accept and return 
strings, so that they can be used in Tcl or tksh scripts. 
The complete list appears in Figure 6. 
deet..breakpoint is a combination of Nub-_set, 
-Nub_remove, and Nub-src. The -set option sets 
all of the given breakpoints, which might be incomplete; 
that is, file can be "", and line and character can be 
zero. The -delete option removes breakpoints, and 
the -1ist option lists possible breakpoints. 
deet_frame is equivalent to Nub_frame: With no 





arguments, it returns the current frame as a Tcl list con- 
taining the frame number, the function name, and a file, 
line number, character number triple that gives location 
of execution within that frame. With an integer argu- 
ment n, deet_frame makes frame 7 the current frame 
and returns the null string. Frames are numbered from 
the top of the stack, beginning with zero. 

deet_getval and deet_putval commands are 
similar to NNub_fetch and _Nub_store, but require 
type information to be specified along with the value, 
because Tcl deals only with strings. Tcl cannot, for ex- 
ample, deal directly with binary floating-point values or 
with structures. Types are specified by type identifiers, 
which are just generated strings. deet_getval returns 
a string representation for the value of type at address, 
and deet-_putval writes the value of type to locations 
beginning at address. 

deet_sym and deet_type return symbol table data. 
A symbol-table entry is a Tcl list { name, type, address }. 
deet_sym’s -al11 option returnsa list of all of the sym- 
bols in the target; that is, a list of three-element lists. 
The -f£iles option returns a list of all of the source 
files in the target. The -locals and -params options 
return lists of the locals and parameters for the current 
frame. The -name name returns the symbol-table entry 
for name, or an error if name is not a visible symbol. 

deet_type returns a string describing the type rep- 
resented by the identifier type. If type represents int, 
deet_type returns "int". Similarly, if type repre- 
sents T *, F [n], or astructure type, deet_type re- 
turns, respectively, the type identifier for T, n and the 
type identifier for &, and a list of names and type identi- 
fiers for the fields. 

NeD [10] is another debugger built on a set of de- 
bugging primitives. This set is larger than the set of 
nub functions and the NeD primitives are at a some- 
what higher level. NeD’s primitives are written in Tcl 
extended with a set of debugging functions. While 
these functions present a nearly platform-independent 
interface, their implementation appears to be platform- 
dependent and perhaps nontrivial. Also, NeD has no user 
interface per se; it uses Tcl in the same way as deet 
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deet_open 


initialize the target 


deet_ breakpoint { -set | -delete | -list } file line character 
set, remove, and list breakpoints 


deet_frame [ n ] 
deet_getval type address 
deet-_putval type address value 
deet-continue 


get/set current frame 

read a value of type from address 
write the value of type to address 
resume execution 


deet_sym { -all | -files | -locals | -params | -name name } 
finds the symbol-table entries 


deet_type type 


get symbol’s type inforination 


Figure 6: deet’s nub interface 


uses the nub functions, while deet uses Tel as its user- 
interface language, as illustrated in the next section. 


4 Programming in deet 


Much of deet itself is written in Tcl and tksh, using 
the deet_* nub commands described above. Users can 
extend deet by writing Tel and tksh commands; for 
example, features like conditional breakpoints and non- 
trivial program queries can be written in tksh. deet 
can also be extended by external programs. This section 
illustrates some typical extensions. 

Simple extensions can be written directly in tksh. 
For example, the following script displays all of the null 
elements in an array, the name of which is supplied as an 
argument. 


function nullElements { 
typeset arr=$1 
integer i s=S${arraySize Sarr) 


for (( 2=0 GT <'s) F pee Wp. do 
if [[ $(var "Sarr[$i]") == 0x0 )] 
then 
print "Element Sarr[$i] null" 
fF 
done 


} 


nullElements uses two external tksh functions: 
arraySize, which returns the number of elements in 
an array, and var, which returns the value of a variable. 
These functions are provided as part of deet. The for 
loop visits each element of the array specified by the first 
argument, retrieves its value, and prints the array name 
and index of the null elements. 

User-defined functions can also manipulate deet’s 
interface. For example, if we’re checking repeatedly for 
null elements in hashtable, we can construct a button 
to do the job in one click: 


toplevel .null 

pack $(button .null.b \ 
-text "Print Null Elements" \ 
-command "nullElements hashtable") 


This code builds the button: 





Tel scripts can be invoked with the source com- 
mand, which is a tksh built-in. source uses the Tel 
Parser to parse its input, and uses tksh variables and 
functions in variable and command substitutions. Here’s 
a simple example: 


function foo { 
X=37 
print "$(bar test)" 


source <<'‘EOT’ 

proc bar {args} { 
global X 
set X [expr $X + 1] 


return "bar: args: Sargs, X: $X" 
} 
EOT 
A call to £oo prints 
bar: args: test, X: 38 


Note that the Tel procedure bar can use and modify the 
shell variable X. Tel source code can also invoke tksh 
functions and built-ins, 

deet’s name space is separate from the target’s name 
space. Accessing a target variable from a tksh script 
requires a special function, var, which uses the target’s 
symbol table to lookup the variable name and retrieve its 
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value. var is sufficient for one-shot lookups, but it’s te- 
dious for repeated uses of specific target variables. For 
these uses, deet provides linkvar name, which cre- 
ates a new shell variable that is essentially an alias for 
the target variable name. 1inkvar is implemented with 
discipline functions, which are similar to trapped vari- 
ables in SNOBOL4 [5]. A discipline function is a shell 
function that is associated with a variable, and that func- 
tion is invoked whenever the variable is read or written. 
Thus, associating the function 


function foo.get { 
foo="$ (var foo)" 


) 


with foo arranges for the target variable to be fetched 
every time the shell variable foo is read. 

deet’s capabilities are easily extended by writing Tcl 
and tksh scripts that use the built-in debugger com- 
mands. An important advantage using a shell as the de- 
bugging language is that the shell can use any external 
tool. For example, it’s relatively easy to extend deet 
to display linked data structures graphically as directed 
graphs. This feature is similar to that provided by the 
Data Display Debugger (ddd) [16], but the implementa- 
tion is much simpler, because deet uses existing tools 
instead of building its own facilities. deet runs dotty, 
a program for drawing directed graphs [9], to draw the 
graph, sending it the appropriate input for the data struc- 
ture of interest. Figure 7 shows an example of dotty’s 
output. The tksh script that invokes dotty is only 
about 60 lines of code, and it handles any linked data 
structure. 


5 Implementation 


deet is written in tksh; each deet command is im- 
plemented as one or more tksh functions that call the 
built-in Tcl nub commands. An example is the b function 
shown in Figure 8, which uses deet_breakpoint to 
set breakpoints. The size of this function is as important 
as its details: most debugging features are easily imple- 
mented in tens of lines of tksh code. 

b begins by converting its first argument into file, line 
number, and character number values. When an incom- 
plete breakpoint is specified, some of these values will 
be converted to null values. Forexample,b 8 causes the 
cvtbp function for the argument 8 to become the value 
of line and for file and char to be null. Next, b 
invokes deet breakpoint -1list to list all break- 
points matching the incomplete breakpoint. If there is 
more than one match, a list of possible breakpoints is 
displayed and no breakpoints are set. If there are no 
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matches, a diagnostic is issued. Finally, if there is ex- 
actly one match, that breakpoint is set. The associative 
array breakpoint keeps track of the set breakpoints. 
The nub doesn’t keep track of breakpoints because it is 
designed to do as little as possible. If the second argu- 
ment specifies a condition for the breakpoint, it’s stored 
as the value for the breakpoint array entry. Finally, 
the source window (if it exists) is updated to highlight 
the set breakpoint. 

The nub interface can set and remove breakpoints, but 
it cannot single-step the target [6]. deet’s step func- 
tion implements single-stepping by setting and removing 
breakpoints: 


function step {( 


if [[{ $cdbMode != "Step" ]]; then 
deet_breakpoint -set "" 0 0 

f i. 

cdbMode=step 

cdbgo # resume execution 


} 


Calling deet_breakpoint with null values for the 
file, line number, and character number sets every break- 
point. Implemented naively, setting every breakpoint is 
expensive in large programs. But the nub could recog- 
nize this special case and use a more efficient implemen- 
tation. As described in Section 6.2, our implementation 
of the Tcl nub functions on top of gdb exploits this pos- 
sibility. 


6 Replacing the Nub 


Ie 66 


An important aspect of deet’s “piece-parts” design is 
that superior replacements could be used for each part 
without disturbing the others. For example, a more ef- 
ficient, machine-specific nub could be used in place of 
cdb’s machine-independent nub; or a better or more fa- 
miliar user interface could be used. 

To demonstrate this flexibility, we’ve implemented 
three alternative versions of deet’s pieces: a version of 
the nub for Java [1], a nub that works by communicat- 
ing with gdb, and a replacement for the user-interface 
component that emulates gdb’s command-line inter- 
face. These limited experiments also reveal strengths and 
weaknesses in the nub-based design. If gdb cannot em- 
ulate the nub, for example, then a simple nub offers fa- 
cilities beyond those of some popular debuggers. If the 
nub cannot support gdb, then the nub is missing some 
important facilities. 


6.1 ANub for Java 


The Java Developer’s Kit contains a debugging package 
(a set of classes) that can be used to explore and con- 
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Figure 7: Tree generated with deet and dotty 


function b { # breakpoint [action] 
integer char=0 line=0 
typeset file point="$1" action="${2-'’:'}" msg 
eval set -- $(cvtbp "S$point") 
file="$1" line="$2" char="$3" 
typeset bp="$(deet_breakpoint -list "$file" "$line" "$char")" 
eval set -- S$bp 
if (( $# > 1 )); then 
msg="Pick one of S$bp" 
elit (( S#¥ = 1 ))> ‘then 
msg="No breakpoint in on line $line char $char" 


else 
deet_breakpoint -set "$file" "$line" “Schar" 
set -- $1 


breakpoint ["$1:$2.$3"]="Saction" 

[{ $CdbWindow ]] && TextDispBpOn $2 
fii; 
{{ $msg ]] && print -- "$msg" 


Figure 8: Implementation of the b command 
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trol the state of a target Java program. These classes 
are designed to support a variety of Java debuggers. 
Java comes with jdb, a simple command-line debug- 
ger, that is implemented with the debugging package, 
and this package is intended to be used to write more so- 
phisticated graphical debuggers. The package works by 
spawning an instance of the Java runtime with the target 
and communicating with it via message passing. 

Although the Java nub itself could spawn the runtime 
and the target, it’s simpler to use the debugging package. 
Implementing a nub for Java required writing the nub 
interface in terms of Java's debugging methods, which 
takes only acouple hundred lines of Java. The routines in 
the nub read messages froma socket, process them using 
the methods in the Java debugger package, and write the 
result messages back to the socket. A central method 
reads messages, decodes them, and calls the appropriate 
methods for each message. 

Figure 9 shows the method frameCmd, which per- 
forms the same task as deet_frame. When called with 
an argument, frameCmd sets the current frame to the 
number specified by the argument by finding the current 
frame with the getCurrentFrameIndex method 
and calling the up and down methods as needed. If there 
is no argument, frameCmd returns information about 
the current frame. This information is returned by call- 
ing methods in the debugging classes RemoteThread, 
RemoteStackFrame, and RemoteClass. 

Thus, with a couple hundred lines of Java code, deet 
can be used to debug Java programs with the same set of 
features that are used to debug C. Unfortunately, the nub 
interface does not currently support threads, which limits 
the usefulness of the Java debugger. 

A similar approach can probably be used on any sys- 
tem that has a debugging interface. For example, deet 
can be ported to Windows by implementing a nub in 
terms of Microsoft’s debugging API. 


6.2 Using gdbas a Nub 


We’ve used gdb to build a variant of deet that uses 
gdb as the nub, and a variant that uses gdb as the user 
interface. 

Figure 10 shows how gdb replaces the nub. gdbnub 
communicates with gdb, which runs as a separate pro- 
cess. gdbnub translates nub function calls into gdb 
commands, sends these commands to gdb, and parses 
gdb’s responses. Another approach would have been to 
modify gdb's code, but past experience shows that mod- 
ifying gdb is a painstaking process [4]. Using tksh 
makes the approach illustrated in Figure 10 much sim- 
pler: the implementation takes only about 500 lines. 

The only nub feature that was not possible to imple- 
ment with gdb was setting breakpoints on any expres- 





target 


Figure 10: Emulating the nub with gdb 


sion. Some nub functions were relatively easy to imple- 
ment, but not efficiently. For example, gdb doesn’t pro- 
vide a way to list all of the possible breakpoints in a file. 
This was implemented by attempting to set a breakpoint 
at every line in each file, and checking which breakpoints 
were successfully set. Fortunately, listing all of the pos- 
sible breakpoints is rarely done; listing the breakpoints 
in a specific line is the common usage. 

As described at the end of Section 5, deet steps 
through a program by issuing the nub command to set 
every breakpoint in the target, which is very inefficient. 
gdbnub takes advantage of gdb’s single-stepping fea- 
ture by using the deet.breakpoint implementation 
shown in Figure 11. When null values are passed to the 
-set option, a variable is set that puts the gabnub into 
“stepping” mode when execution is resumed. Similarly, 
when null values are supplied with the -delete option, 
gdbnub reverts to “continuation” mode, and all tempo- 
rary breakpoints are removed. 

When gdb is used as the nub, deet can be thought 
of as a graphical front end to gdb. It provides facili- 
ties similar to ddd’s, which is also a front end for gdb. 
However, ddd is as not programmable. 


6.3 A Nub for gdb 


Implementing gdb on top of the nub is difficult, be- 
cause gdb is a huge program and has a large number 
of features. Some of the features in gdb are inherently 
absent in the nub. For instance, gdb allows the target 
to be examined at the machine level; gdb can exam- 
ine registers and single-step instructions. The nub inter- 
face is machine-independent, so it cannot provide these 
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public void frameCmd(RemoteThread t, String args[]) throws Exception { 


if (args.length == 2) { 
int oldFrame = t.getCurrentFrameIndex({)j; 
int newFrame = Integer.parseInt(args[1l]); 
try { 
if (oldFrame < newFrame) 
t.up(newFrame-oldFrame) ; 
else if (oldFrame > newFrame) 
t.down (oldFrame-newFrame) ; 
} catch (ArrayIndexOutOfBoundsException e) { 
outputError(); 
} 
} else { 
RemoteStackFrame s = t.getCurrentFrame(); 
RemoteClass c = s.getRemoteClass(); 
outputItem(t.getCurrentFrameIndex()); 
outputItem(c.getName() + "." + s.getMethodName()) ; 
outputItem(c.getSourceFileName({)); 
outputItem({s.getLineNumber ()); 
outputItem("0"); /* No support for char position */ 


Figure 9: Implementation of deet_.frame for Java 


function deet_breakpoint { 
typeset i action=list 
case $1 in 


=1*). -shitt: 94 
-s*) action=set ; shift 
if [$l = *° && $2 => "0" £& SS 2 "0" J) > then 
sendCommand "b main” 
CONT_CMD="step" 
return 0 
EX Ha 
-d*)  action=delete ; shift 
if EL SL, = we eo, $2! = 80" Ces. = CO" jee Chen 


sendCommand "delete" 
for iin "${!GdbBreakpoint[@}}" 
do unset GdbBreakpoint [$i] 
done 
CONT_CMD="cont" 
return 0 
fa 5, 
esac 


Figure 11: Implementing single-stepping 
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Figure 12: Running gdb with the nub 


machine-dependent features. Similar caveats apply to 
data watchpoints, which gdb supports on machines with 
the appropriate hardware. gdb also supports features 
that are irrelevant to debugging, per se, such as control- 
ling terminal modes and displaying gdb-specific online 
help. 

Our third experiment thus focuses on only those gdb 
features used by a graphical front end, like ddd. Fig- 
ure 12 shows the organization of this experiment. 

The implementation of gdb using the nub was written 
with deet commands. gdb’s frame command illus- 
trates the general approach. frame controls the current 
frame in the stack of the target. Invoking frame with 
an argument directs gdb’s attention to a specific frame. 
If no argument is specified, the current function with its 
arguments and location is displayed. For example: 


(gdb) frame 
#0 lookup (word=0xllffffF8e0 "a", 
p=0x140000010) at test/lookup.c:15 


Figure 13 shows the tksh implementation of frame. 
The function uses deet_frame to move the nub’s at- 
tention to a new frame. It also uses deet_sym and 
deet_getval to fetch and display the frame’s parame- 
ters and their values. 

This implementation of gdb, although incomplete, is 
only around 1,000 lines of ksh. It implements enough 
features to support ddd. gdb features that were not im- 
plemented include: 


e Debugging a target that is already running, which 
gdb can on machines where this is possible. 


e Invoking target functions from the debugger; the 
nub doesn’t support this feature, because a separate 
evaluation facility can support it [13]. 
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e Examining core dumps; this feature could be sup- 
ported by writing a nub specifically for browsing 
core dumps. 


e Interrupting a running target. 


e Handling signals. 


7 Discussion 


deet’s front end runs on any machine on which tksh 
runs, which currently includes virtually all UNIX vari- 
ants, Windows NT and Windows 95. Graphical debug- 
gers that work consistently under both UNIX and Win- 
dows are scarce, and having a uniform interface can be 
important. Programmers writing code for multiple plat- 
forms can debug applications without having to learn 
multiple environments. Another advantage of a uniform 
interface is that one set of debugging scripts is often suf- 
ficient for all platforms. 

The nub hides of most of the difficult portability is- 
sues. deet is available on all of 1cc’s platforms, be- 
cause its nub interface is machine-independent. deet 
is also available on platforms that support gdb, because 
it can use the nub that runs on top of gdb. deet can 
be made available on other platforms by writing a new, 
platform-specific nub. Typical nub implementations take 
less than a thousand lines of code, so they aren’t triv- 
ial, but the effort required is tiny compared to porting a 
machine-specific debugger. 

deet demonstrates that it is possible to build a us- 
able debugger with a graphical user interface from sim- 
pler components, and, as the dotty example illustrates, 
that the result is more than just the sum of the parts. 
deet also confirms cdb’s premise that most of a de- 
bugger is machine-independent, and that the fundamen- 
tal machine-specific debugging facilities can be encapsu- 
lated in a small, machine-independent nub interface. 

deet doesn’t have all of the features offered by PC 
debuggers and UNIX debuggers like gdb. But it does 
provide the most important ones—at a fraction of the 
implementation cost. deet is about 1,500 lines of tksh 
code, and the machine-independent nub and related com- 
piler support (in 1cc) total around 800 lines of C. These 
2,300 lines of code are orders of magnitude smaller than 
gdb's 150,000 lines and ddd's 90,000 lines. 

Programmers interact with most debuggers in the tar- 
get’s source language plus a few debugger-specific com- 
mands. For example, programmers throw C expressions 
at gdb to browse the state of a buggy target. The ad- 
vantage of this approach is that programmers don’t have 
to learn another language to use the debugger. But, as 
Acid [15] and Duel [4] demonstrate, exploring a pro- 
gram’s state is fundamentally different than writing the 
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function frame { # [num] 


[[ $1 t= "" ]] && deet_frame $1 2> /dev/null 


set -- $(deet_frame) 


typeset num=$1 name=$2 file=$3 line=$4 char=$5 
typeset params="$(deet_sym -params)" p result 


result="#$num Sname(" 
eval set -A parm $params 
for p in "${parm[@])"; do 


#633.) “ 


set -- $p 

result="Sresult$(1##*:}=$ (deet_getval "$2" 
done 
print -- "${result%’, ‘)) at $file:$line" 


Figure 13: ksh implementation of gdb's frame command 


program in the first place, and this exploration can be 
done much more effectively in a higher-level language. 
deet also supports this view; Tcl and tksh seem to 
be better languages for writing debugging code than lan- 
guages like C and C++. Similar comments may apply to 
other high-level scripting languages, like Perl. 
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Abstract 


Cget, cput, and stage are three simple programs that implement authenticated or encrypted file 
transfers on the Internet. Cget and cput read and write files to a remote host, and stage ensures 
that a remote directory accurately mirrors a local master directory. 


These routines use private key cryptography for authentication and privacy between pairs of 
secured hosts. They are simple, paranoid Unix tools that can be used to support systems that 


operate in a hostile environment. 


1 Introduction 


A host can be made reasonably resistant to compro- 
mise from the Internet if it isn’t running any danger- 
ous network services. Most hosts don’t come from 
the manufacturer configured in this manner—they 
have to be stripped of all of their network services by 
hand. Then only the desired services are installed. 
If the network services are secure (a big “if”), then 
machines are much harder to breech. 

Two such secure hosts can exchange files and re- 
main reasonably secure if there is a safe file trans- 
port service available. This file transfer service 
would have to be resistant to all the popular attacks 
found on the Internet today, including the recent IP 
spoofing[12] and TCP hijacking[10]. If the data is 
sensitive, then the service would also have to ensure 
privacy. 

Such a service is needed often these days. In par- 
ticular, publicly-available Internet services like FTP 
and http are often provided by hosts running on the 
dirty side of the firewall, usually in a DMZ. How can 
we administer such hosts, perhaps from the relative 
safety of another host behind a firewall? How do we 
install new programs, or install new content? 

The standard out-of-the-box network services do 
not provide this security. In fact, they have had a 
history of jeopardizing their servers. FTP has been 
a constant source of trouble: its passwords are easy 
to sniff, its protocol has various flaws, and various 
servers have had security holes [5, CA-88:01, CA- 
92:09, CA-93:06, CA-94:07, CA-94:08, CA-95:16]. 
Rep relies on address-based authentication, and the 


integrity of the Domain Name System, which is easy 
to fool [4, 16]. It is also susceptible to IP spoofing 
attacks [14]. NFS has a variety of weaknesses: it can 
be fooled with address spoofing, root handles can be 
sniffed or guessed, and it relies on RPC services that 
have security weaknesses of their own. 

These services are frequent targets for successful 
hackers. Still, they are often employed because they 
are widely available, and most developers are famil- 
iar with them—even when they are clearly unsuited 
for the job. Developers don’t have time to build 
new tools: the frenzy of Internet hype and growth 
can leave management focused on time-to-market is- 
sues. Security is often left to the last minute, and 
patched in after the design is finished. 

Marcus Ranum and I faced these problems in the 
fall of 1995. We encountered ad hoc solutions using 
standard, dangerous Internet services. For example, 
billing data was transferred in the clear using FTP. 
Developers cast about for ways to transfer configu- 
ration files and other important data. The standard 
tools they used were jeopardizing some very impor- 
tant hosts. 

We wanted a very simple solution, in the tra- 
dition of small Unix tools. This is not a very tall 
order: it requires a simple file transfer program run- 
ning some strong cryptographic or authentication 
routines, and a shared secret key. 

We were only setting up a few point-to-point 
links, so it was easy to distribute a secret key to 
each end of a connection. We didn’t want an au- 
thentication server or need public key cryptography. 
We envisioned that a pair of simple programs, plus 
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a key file and a configuration file, were all that were 
needed. A simple implementation meant that more 
people were likely to understand and use the soft- 
ware, even if they were in a hurry to make a dead- 
line. 

There were a variety of possible off-the-shelf so- 
lutions available at the time: our problems were not 
new. Most of them appeared to have adequate secu- 
rity, but none were as simple as we desired. (Some 
of these are discussed in section 9.) 

We have a better chance of avoiding security bugs 
if the programs are small and simple. 

Marcus and I built three programs. Each was 
as simple as possible, and written with all the mini- 
malism and paranoia we could muster. Cget reads a 
single file from a remote server, and cput writes a file 
back. Cget and cput are quite primitive: they do not 
support file deletion, directory creation, or program 
execution. (They were originally named get and put, 
but that clashed with SCCS’s routines of the same 
name.) Stage mirrors a master directory to a slave 
host. Files and directories are created, deleted, and 
downloaded as needed. It is ideal for updating an 
external web server or FTP archive from an internal 
staging host. Unstage, a recent addition, works in 
reverse, updating a local copy of a remote master 
directory. It can be used to suck logs from a server, 
or perhaps get a distribution from a master tree. 

Our initial cryptographic routines used DES en- 
cryption. For many applications, this is overkill. 
Most transfers are not secre-—-FTP and Web data 
are generally intended for public viewing. 

If we kept strong encryption, it would make the 
software release difficult and unlikely. 

So the crypto layer has been rewritten—the in- 
terface and code are cleaner than the first version. 
The new crypto routines use only HMAC keyed mes- 
sage digests to protect the conversation. If my au- 
thentication protocol is OK (always a big ‘if’), an 
eavesdropper may watch or interrupt a session, but 
cannot modifier replay a session without detection. 

The next section describes the authentication 
protocol and some cryptographic issues. Section 3 
describes the user interface to the cryptographic pro- 
tocols. Server design issues are explored in section 
4. The stage service is discussed in some detail in 
Section 5. Section 6 has a couple of applications 
for these programs, including the confinement of an 
arbitrary TCP service. Section 7 covers vulnerabili- 
ties, and section 8 has some performance figures. A 
tiny sample of the related work is discussed in sec- 
tion 9. Section 10 describes some enhancements and 
limitations to these routines, and availability infor- 
mation is in section 11. Appendix A describes the 
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staging protocol. 


2 The Authentication Protocol 


The client and server exchange messages in SSL for- 
mat, although we do not use SSL’s complex key 
setup. Each message contains a two byte length 
field, the payload, and a 16 byte binary digest. 

I use the HMAC{3] digest with MD5, which ap- 
pears to be headed for general usage on the Inter- 
net. An HMAC digest of a message M using key k 
is shown as 


(M]j. 


The client and server share a secret key, M55. 
The protocol uses challenges in both directions to 
derive session keys. The challenges are sixteen ran- 
dom bytes encoded in pairs of hex digits. The ses- 
sion key for each end is derived from Ks, and the 
challenge from the other end: 


[(C.]k,, 
[Cs] x. 
The server writes with A,, the client with K,. The 


initial exchange between client C and server 5S pro- 
ceeds as follows: 


Ia 
Tass 


Message l C—-S: 
Message 2 S—-C: 
Message 3 C—-S: 


N, C., [N, Ce, Selo 
Cs ? [Cs ? Sslx, 
“OK”, [“OK”, Sex. 


Here N is a service name (see section 4.1), and S; 
and S, are sequence numbers, C;, is the client’s chal- 
lenge and C, is the server’s challenge. The sequence 
numbers are four bytes long. S, starts at zero, and 
S, starts at 23), 

Message 1 delivers the client’s challenge. It uses 
the key 0, which helps us detect casual probes of 
the service. Message 2 proves to the client that the 
server is using the new challenge, has the secret key, 
and provides the server’s challenge. Message 3 has 
a trivial payload, but proves to the server that the 
client is using the fresh challenge and the secret key. 

The session keys are used to prevent replay at- 
tacks using messages from previous sessions. If we 
simply keyed our digests with the A’,;, an attacker 
could replay a previous session, perhaps replacing a 
new file with some older one. Each end uses a differ- 
ent session key so a message can’t be played back to 
its originator. Similarly, the sequence numbers pre- 
vent replays of earlier messages in the same session. 
Without these, an attacker might hijack the TCP 
session, and replay earlier messages. The sequence 
numbers differentiate the hashes from each end of 
the conversation. 
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Although the client can force the server to use 
a specific challenge, and therefore the same key, it 
can’t finish the protocol initialization without using 
the server’s fresh challenge and the secret key. 

A man-in-the-middle can’t change a message 
without detection: he cannot determine the session 
key without the secret key, and he cannot obtain the 
right answers from an additional connection to the 
server, since the session key will be different. 

This is a simple protocol, and it looks like it 
ought to do the job, but I am not a cryptographer, 
and history teaches that it is hard to get crypto- 
graphic protocols right. Though I don’t see how this 
protocol can be abused, I’d feel better if each session 
key were based on both challenges. 


2.1 Administrative concerns 


Protocol setup can interact with administrative con- 
cerns. Our original protocol was terse and unhelpful 
when, say, one end had the wrong key. The obscu- 
rity may have added some security, but it sure didn’t 
help our users who were trying to set up the service. 

The setup may fail for a number of reasons: the 
key is wrong or missing, the key file is not readable, 
the service is non-existent, etc. Each of these errors 
means that the server cannot return the correct di- 
gest in its first message. If the digest is wrong during 
protocol setup, the client checks the payload for the 
string “remote reported an error:”. If present, 
the rest of the message is an error message from the 
server describing the problem. 

Until the exchange is complete, the protocol is 
subject to attack. In particular, it is possible for an 
attacker to inject a false error message during this 
setup phase. 


2.2 Keys 


Secret keys are appropriate here: we don’t need pub- 
lic key cryptography. The usual complaint about 
secret keys is the distribution problem—how do we 
move them around securely? For us, these programs 
are only employed in a handful of hosts, involving 
perhaps a dozen services and their keys. 

Our keys are printed in hex bytes or base 64 
encoding, both human-readable. They can be dis- 
tributed by hand or over the phone when the service 
is installed. We type ours in at the consoles of the 
hosts involved. 

This might not scale to a large setup, but one 
could imagine an ISP allowing a thousand customers 
to use stage to update a thousand separate web di- 
rectories. It would not be much harder to distribute 


a binary key than a password that a user has to re- 
member, and the client wouldn’t need an account on 
the web server. (This is always a good feature: users 
are annoying, and tend to disrupt security arrange- 
ments.) 

The secret keys are generated by a program 
named makekey. Its keys, and the protocol’s ses- 
sion challenges, are generated with truerand[1]1]. 
(Truerand runs a counter ina tight CPU loop while 
waiting for an alarm timeout some milliseconds later. 
The bottom two or three bits of the counter are con- 
sidered random.) 

We use full random binary keys: there are no 
passwords that a user must remember. 


83 Interface Routines 


To use the crypto routines, the client uses the fol- 
lowing code: 


fd = tcpconnect(host, port); 
ep = start_client_crypto(fd, key, 
srv_nam); 
if (ep != 0) { 
/* error */ 
} 
n = cread(fd, buf, sizeof(buf)); 


n = cwrite(fd, &gt, ngt); 

if (n <0) { 
cperror("writing gunilla table"); 
exit(1); 

- 


and the server uses 


fd = bindto(service_port); 
ep = start_server_crypto(fd); 
if (ep) { 
perror(ep); 
exit(1); 
} 
n = cwrite(fd, "hello", 6); 
n = cread(fd, buf, sizeof(buf)); 


(Some error processing is simplified for clarity.) 
cread and cwrite are analogous to the standard I/O 
routines. Cperror reports standard or cryptographic 
errors. This protocol preserves message delimiters: 
each cread will return only the bytes sent by the 
corresponding write. The maximum message size is 
gWas 1. : 

The server must supply a routine named setser- 
vice, which obtains the secret key or returns an error 
message. 
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4 Services, servers, and server trust 


Our server programs assume that they have no 
friends. For example, they shed privilege early to 
minimize the code we must trust. The server pro- 
grams obtain the key for the calling host, chroot toa 
target directory, and change their user id and group 
to some less-privileged account. All inputs from the 
client are carefully checked for pathological values. 

We have two servers: getd for cget and cput, and 
staged for stage and unstage. They operate on dif- 
ferent TCP ports, and are called by inetd. 


4.1 Service names 


Originally, these routines were keyed to a host’s 
numeric IP address. A single client host could 
cget/cput to one area of a server, and stage to an- 
other. If others needed to stage to the same server, 
they would have to connect from a different client 
host. The IP address of the caller was used to se- 
lect the proper key and service configuration on the 
server. Though it provided only slight security, the 
connection had to originate from that IP address. 
The secret key provides the real security. 

This approach worked well for simple setups, but 
the one-service-per-client limitation became incon- 
venient as the services were used more. Also, a 
traveling host couldn’t access a fixed server, because 
the client’s IP address was unpredictable. We could 
provide additional services on different ports on the 
server, but this is awkward. 

The new authentication protocol includes a ser- 
vice name. The key and other information are based 
on this name. A single client host can have a number 
of services on a given server. Different users can have 
different access to the same server, all controlled at 
the client end by read access to the relevant key file. 
This trust model shouldn’t be pushed too far: Unix 
root accounts are generally not very resistant to user 
attack. A serving host should extend about the same 
level of trust to all services from a given client. 


4.2 Trusting the Server Software 
So far, I have assumed that 
1. we control the server machine entirely, and 


2. we trust the server code until it drops privi- 
leges. 


Neither assumption may be true if we would like 
to persuade some one else (say, an ISP) to run our 
servers on their host. They would be more willing 
to run our software if we don’t need root permission, 
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and perhaps even more willing if they can contain 
our software with their own chroot. 

The problem is that chroot requires root per- 
mission, and it is hard to change the user id safely 
with standard shell commands after the chroot. We 
have to include a setuzd program within the software 
“jail” to change from user root, and the user with 
access to this directory might find a way to disable 
this program. 

The chroot program needs an option to set the 
UID of the executing program. I wrote a trivial ver- 
sion of chroot named jail to do this. I can give this 
tiny program to the ISP. It’s only a few lines long: 
they can examine it, trust it, and confine our servers 
nicely with it. 

We have some problems when the server is en- 
closed in the jail. How does the program obtain its 
key, unless the key is stored within the jail itself? 
We may wish to keep the key secret from the user. 
If the key is stored in the jail, a remote user may 
have undesirable read access to it. It could be piped 
in through a file descriptor opened by jaz/, but that’s 
a bit awkward to set up. The key could also be a 
parameter to the server program, but that can make 
it visible to other users on the serving host through 
the ps command. 

It would also be nice to let the jailed server issue 
syslog messages. This requires possibly-dangerous 
special files inside the jail, or some mechanism for 
jail to perform the openlog and pass the file descrip- 
tor to the server in the jail. 

I am not satisfied with the solutions to either of 
these problems. Chroot is a good start, but Unix 
lacks adequate confinement primitives. 


5 Stage 


Once the crypto routines were working for cget 
and cput, stage was an obvious application. Stage 
runs through a local master directory, comparing 
each file and directory with the contents of the 
remote slave directory. ‘The master directory for 
a service is identified by an entry for that ser- 
vice in configuration file on the client, usually in 
/usr/local/etc/stage.conf. 

Like cget/cput, stage uses the service name to 
look up the appropriate key. Staged also uses the 
service name to determine the target directory, user 
id, and file permission mask. It uses chroot to con- 
fine itself to the target directory. This is good: it 
is somewhat more complex than getd, and therefore 
more likely to have bugs. It does check its input 
from the client carefully: strings can’t be too long, 
“..” is not allowed in path specifications, etc. 
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Stage will update a file if it has changed. A file is 
considered changed if its modification date or length 
are different, or if its MD5 checksum is different. The 
checksum is time consuming—an option suppresses 
this check. When a file is copied over, its modifi- 
cation date is set to match the master copy, if the 
operating system on the server allows it. 

Ordinary users can use stage to update all or 
some portion of the master directory. It only takes a 
few seconds to check and transfer a few megabytes. 
Stage does not exit until the update is complete: 
there is no queuing mechanism involved. Should the 
program abort or fail for some reason, it can be rerun 
to ensure that the directories match. 

The user can stage a file or a directory. Either 
must appear under the master directory. If he stages 
a non-existent path, that path will be deleted on the 
server, if it exists. Hence: 


rm -rf foo 
stage remote foo 


will delete a directory or file named foo at the other 
end. There is a subtle distinction here: 


stage remote * 


and 
stage remote . 


are not the same. The first entry will update all ex- 
isting files and directories in the current directory. 
The second will do that, plus delete any remote en- 
tries that don’t appear locally. 

Stage makes no special provisions for special files 
or soft or hard links. It makes no special provisions 
for files or directories that change during the staging 
process. 

The user does not have to have read access to 
the key file: stage could be setuid to an account or 
group that has read permission for the key. 

Staged serves stage requests on the remote ser- 
ver. It consults /usr/local/etc/staged. conf, 
which has one line for each supported service name. 
Each line contains the service name, the slave direc- 
tory, and an optional UID and umask. File owner- 
ship is not propagated to the server: the slave files 
are owned by the account that the service is config- 
ured for. Staged logs all file activities via the syslog 
facility. 

The siage client uses a little protocol to control 
the remote server. It is very simple (see Appendix 
A), a subset of basic file system access primitives. It 
could be optimized to improve performance. 


6 Applications 


These routines have been well-received, and are em- 
ployed in a number of places in AT&T and Lucent. 
In our lab, stage has been providing quick and sim- 
ple support for ftp.research.att.com for over a 
year, and now supports ftp. research. bell-—labs. 
com and Lucent’s Web service on www. lucent. com. 
We had to make a proxy version of stage for this last 
service to support it through our older firewalls. It 
replaced a clunky FTP-based implementation that 
was unable to delete files on the slave server. Cput 
has been especially useful for moving new binaries 
to external hosts. 

One researcher has been transporting large, ex- 
tremely sensitive databases over networks having du- 
bious or unknown security. The encrypted version 
of cput is vital for him. 

We have also supported our various external 
dirty Untx hosts with these tools. Automated jobs 
use cpui to update mail alias files and internally- 
generated name server configuration files. Cged 
fetches daily log files so users can monitor access to 
the FTP directory without requiring logins or more 
invasive access to the server. Hal Purdy at AT&T 
Research has ported the encrypting staged to Win- 
dows 95 to support a roomful of PCs outside the 
firewall. 


6.1 An Example: updating mail alias 


files 


Traditional Unix filesystem permissions can be used 
to provide finer control over access to the target di- 
rectory. For example, we have a slave mail directory 
on an external server that contains programs and 
data files. The master host is allowed to overwrite 
and create the mail alias files, but is not allowed to 
change the executable files in the slave directory. We 
could just split them into separate directories, but 
we are used to the current configuration. 

The executable files are owned and writable by 
account bin. The directory and the alias files are 
owned by daemon. This lets the master host update 
and add new alias files, but it doesn’t have write per- 
mission for the executable files. If we supplied com- 
plete Unix file system semantics, they could delete 
the executables (since they have write permission on 
the directory) and create new ones. But cget won’t 
delete a file, only replace it. There’s one additional 
protection: we can have the umask in gedd’s config- 
uration file clear the executable bits in downloaded 
files. 

If we rely on UNIX’s built-in file permissions, 
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we reduce the amount of special-case permission- 
checking software in gefd. Smaller programs are 
safer. 


6.2. Bob’s TCP service 


Bob has a service that he would like to offer the 
world. I don’t know much about it. It looks at some 
files, answers queries on a TCP port, and writes logs. 
It doesn’t need to be root, or use any fancy system 
services or files. It’s just a program. 

I like Bob, and I trust him (mostly). His service 
is not only harmless, it appears to be quite worth- 
while. He’d like to run it on our external server. 

Id like to help, but I don’t want to jeopardize a 
host that we’ve taken great pains to secure. I don’t 
want an error on Bob’s part to reduce that host’s 
security. How far do we have to trust him? 

We can lock Bob’s software and files inside a ch- 
root environment. He won’t be running as root, and 
we won't leave any dangerous programs in his jail, 
so we are pretty confident that he can’t get access 
to the rest of our file system. We add the following 
single line to /etc/inetd. conf: 


23950 stream tcp nowait root 
/sbin/chroot chroot /usr/bob 
/bin/su - bob -c /tree/bin/zsrv 


We use chroot to confine Bob, then su to give up 
root privilege, before Bob ever gets to execute an in- 
struction. The security measures are right out there 
in the open, on the line where an auditor can see and 
understand them. Unfortunately, /bin/su is inside 
his jail, so we use jail instead: 


23950 stream tcp nowait root 
/sbin/jail jail -u 99 -g 2 /usr/bob 
/tree/bin/zsrv 


Jail is just like the chroot command with the ad- 
ditional -u and -g options to set the user id and 
group. 

His program, zsrv should be linked with static 
libraries: it greatly simplifies the setup of the jail. 

We split Bob’s directory (/usr/bob) into two 
subdirectories: one (/usr/bob/tree) that he can 
stage into, and one (/usr/bob/1log) that he can cget 
or unstage from. He can update his server software 
and other files under /tree at will. He can change 
the network server, but it’s always owned and exe- 
cuted on account bob, not root, so he is very unlikely 
to get out of his jail. 

What can Bob, or someone who has hacked into 
Bob’s account, do to us? Most of the problems are 
of the denial-of-service type: 


1997 Annual Technical Conference 


e File System Full. He can fill the partition that 
holds /usr/bob/tree. If this is a concern, we 
can put him in his own partition. Then it only 
breaks his program, though it may cause an- 
noying log messages elsewhere. 


e Core dumps. These can fall under the file- 
system-full problem. The chroot environment 
assures that the core dump will go in his di- 
rectory, not somewhere else. 


e CPU hog. If he uses too much of the CPU, we 
can nice him down before he starts. 


e Memory Full. He could eat up or thrash mem- 
ory. 


e Open Network Connections. Our jail doesn’t 
stop Bob from opening outgoing network con- 
nections. This could be abused in a few ways. 
In particular, he could try to embarrass us or 
take advantage of the good name of our serv- 
ing host. Since these can be spoofed anyway, 
it shouldn’t be an unusual problem. 


Bob has been running his Z39.50 service for over 
a year. Aside from a few core dumps and some occa- 
sional configuration changes while companies lurch 
apart, Bob hasn’t been a problem. 


7 Vulnerabilities 


The administrator is often the source of security 
problems. It is easy to leave key files around with 
unintended read permissions or incorrect ownership. 

If the server is compromised, the game is lost: 
our original goal was to protect the server. 

If the client is compromised, the intruder can 
gain access to all the services allowed to that host. 
The server’s paranoia can block further spread of 
such an attack, limiting the intruder’s access to a 
particular directory on the server. This can be an ef- 
fective barrier which adds another layer to the depth 
of security. 

If either host is compromised, then so are the 
keys, since they are stored on files. This alarms 
some people, but the keys are only as valuable as 
the host—once the host is compromised, the key is 
useless any way. We’ve attempted to limit the ex- 
tent of such a catastrophe by limiting the trust each 
host has for the other. 

The theft of a key can compromise the privacy of 
old sessions of the encrypting transport. We don’t 
change keys often. 

Session interruption can be a problem. If im- 
portant security files are transmitted, the user must 
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note that an attacker can abort a transfer. For 
example, tcpwrapper [15] has a list of permitted 
connections in /etc/hosts.allow and a stoplist in 
/etc/hosts.deny. If they are transmitted in that 
order, there is a time when the /etc/hosts.allow 
is in place without the exceptions. This time win- 
dow can be enlarged by interfering with the second 
transfer. An administrator must bear these concerns 
in mind when moving security-related files. In this 
case, the /etc/hosts.deny file should be transmit- 
ted first. 

As mentioned before, the protocol’s error mes- 
sages may be subject to social engineering during 
setup. It’s an unlikely attack since the timing is 
tight and the administrator is likely to be watching 
both the client and the server during initial config- 
uration anyway. 

As always, a network service accessible to the 
public can always receive denial-of-service attacks. 
Even if the digest is wrong, enough incoming packets 
can swamp any service, making it unavailable to its 
intended users. 


8 Performance 


We have found that these routines work with ac- 
ceptable speed. With our first protocol using Eric 
Young’s DES encryption library we could transfer 
500 KB/s through the localhost port of an SGI run- 
ning with 150 MHz CPUs. 

In another test, our entire 700 megabyte FTP 
directory was copied between two fast SGI hosts 
through a firewall in about 47 minutes. When siege 
was rerun just after this transfer, without any up- 
dates, the check took about 330 CPU seconds and a 
little over 32 minutes. With checksumming turned 
off, it took six minutes. 

We don’t expect that an entire large directory 
will be staged more than about once a day. Users 
find the instant-update feature handy, and tend to 
stage the little bits they change quite often. The 
protocol is not especially efficient: it could be en- 
hanced to speed up the file-checking phase. 


9 Related Work 


We have reinvented a well-rounded wheel. There 
are a number of software solutions available, some 
well-suited to their environments. We want a very 
light-weight solution. Private keys are fine, without 
authentication servers or fancy certificates. If the 
infrastructure is in place for such tools, use them: 
Rep with Kerberos would be fine, if we were run- 


ning Kerberos in the first place. But most of our 
customers don’t run Kerberos. 

Ssh{17] was fairly new when we did this work. It 
provides a transport protocol that will probably be 
quite secure—it is under repair at this writing. In 
late 1995, ssh was a beta release. But for our uses, 
ssh has too many features—it’s too large. It offers 
optional features we don’t want, like X11 transport 
and login facilities. We didn’t necessarily want these 
additional services into our secure hosts. The code 
is more complicated, and there are more things to 
misconfigure. (We do use ssh in other places: it’s a 
nice package.) 

There are a number of mirroring and general 
software distribution packages available. The best 
known is rdisi [9]. Rdist typically uses rep or ssh for 
transport. The former is not appropriate, but ssh 
is a good choice for supporting rdzst, and it has its 
enthusiastic supporters. There’s a lot of mechanism 
there for our simpler applications. Rdist has earned 
three CERT advisories [6, 7, 8], which also makes us 
nervous. 

Two other transport programs include filets f [13] 
and Mirror [2], a Perl program. Mirror uses FTP 
for transport, filelsf uses Ipr/lpd. Again, we don’t 
want these additional services on our safe machines. 

Other file transfer programs have appeared re- 
cently, such as SSLftp. See [18] for a wide assort- 
ment of related tools. 

We’ve seen other batch approaches to these prob- 
lems. A file can be signed or encrypted with PGP, 
and transported by FTP or even email. These 
batch processes are bulky and unsatisfying. They 
require action by special accounts, often initiated by 
a polling program run by cron. Stage provides im- 
mediate updates, initiated by the end user. 

Lower level encryption, like IP/SEC and IP/V6, 
offer more-general solutions. We could use various 
existing tools if these were deployed. Unfortunately, 
they are not widely available yet. We needed these 
tools a year ago. 


10 Limitations and further work 


Although these routines are fairly straightforward, 
some users have prevailed with some minor enhance- 
ments. Cput does not have an option to run a 
program when the transfer is completed—a feature 
found in some file transport programs. In one ap- 
plication the receiving host must scan the receiving 
directory with a cron job looking for new files to pro- 
cess. These files are large (on the order of a gigabyte) 
and take a while to transfer. The cron job needs to 
know when the file is available. We set the file per- 
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missions to 0000 until the transfer is complete. The 
same application needed a unique file name on the 
destination host. An optional string (”%u”) in the 
destination file ensures a unique file name. 

These routines make no effort to deal with a 
file that gets shorter during the transfer. The user 
should ensure that the source files don’t change dur- 
ing transfer. 

Stage makes no attempt to lock the target di- 
rectory. If two people stage to the same part of a 
target directory at the same time, the results are 
undefined. Stage can also overwrite programs while 
they are executing, causing core dumps in many ver- 
sions of Unix. Since it doesn’t handle special files or 
links, it is probably unsuitable for updating a remote 
root directory. 

One could teach stage about hard and symbolic 
links, but it would add a lot of complexity to the 
program, which doesn’t seem to be worth it. 

There is no mechanism here for a client user to 
determine how much disk space is available on his ex- 
ternal] partition. The easiest solution is to install the 
master directory on a partition of the same size as 
the slave’s partition. The user can monitor his inside 
usage. Stage makes no special effort to delete exter- 
nal files before installing new ones, so the outside 
partition could conceivably fill up during an update 
if it was nearly full. 

Some users wanted more control over the update 
process. Stage’s scan can take a long time, partic- 
ularly if the directory tree has many gigabytes and 
checksumming is used. These users wrote scripts 
to create a list of files to update, and zargs can 
feed this list to stage. I had to add a parameter 
to suppress the descent below directories that were 
not mentioned on the command line, so we wouldn’t 
do more work than these scripts wanted. 

Cryptography can be no better than the qual- 
ity of the keys. It is hard to generate key mate- 
rial with general purpose computers. I rely entirely 
on truerand to get this right. I did generate 10 
megabytes of random data from truerand (it took 
several days) and had Eric Grosse, one of our local 
numerical analysts, run it through a suite of ran- 
domness tests. It passed. 

There have been some problems reported with 
a slightly-restricted version of MD5 lately. Perhaps 
MD65 will fall soon. It is still possible that HMAC 
using MD5 would still be safe: HMAC frustrates 
some attacks on its hash primitive. In any case, I 
will switch to SHAL. 

In general, we like these routines the way they 
are, and are resistant to creeping featurism. We like 
their simplicity, and their interaction with standard 
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Unix tools. 


10.1 The Joint Ventures Problem 


Joint ventures often occur between two companies 
that don’t otherwise trust each other. Many such 
joint ventures only need to share a directory tree. 
This can reside on a neutral host somewhere. The 
contents of the directory tree, or the existence ven- 
ture itself, may be highly proprietary. 

The stage command offers most of the function- 
ality needed to implement such an arrangement. Un- 
stage reverses the file transfer: the remote directory 
is the master and the local directory is the slave. 
User’s can share their work with these routines. 

These two tools lack only a locking mechanism, 
which would reserve a subdirectory or file. For ex- 
ample, assume that two authors work for separate 
companies, but need shared access to the source for 
their book. One could lock a chapter that he is work- 
ing on, and the other would have only read access 
to that chapter. The chapter could be staged back 
and the lock released. 

I’ve tried to come up with some simple mecha- 
nism to enforce locking using file system permissions 
in the master directory, without a satisfactory result. 
It would be nice to change the owner of a locked di- 
rectory, but that requires more privilege than I am 
willing to give the server software. 


11 Availability 


The early DES versions of these routines are freely 
available to AT&T and Lucent employees, and may 
be found on the companies’ Intranets {1]. I will not 
attempt to distribute these. 

Marcus has published his original get and put 
routines, with the original crypto API but not the 
DES routines [20]. 

I expect to have publication clearance for the 
authentication-only versions for non-commercial use 
in time for this conference. I am keen to release these 
routines to the general public. A general release will 
expose them to public review and possible improve- 
ment. Good cryptography and secure programming 
are hard to do— it is in our corporate interest to 
run these routines through the wringer. 

See [19] for obtaining this software. 
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Appendix - the Stage Protocol 


This is the little protocol that stage and unstage 
use to control staged and the remote directory. The 
commands and responses are ASCII fields separated 
by a single blank and terminated with a zero byte. 


send xm fn 
rcv OK 


Remove the given file or directory. Ev- 
erything beneath the directory is re- 
moved as well. Returns either “OK”, 
“ENOENT” (not found), or a string de- 


scribing some other error. 


send st fn 
rcv uid gid mode mtime size 


Return the stat of a file or directory. The 
mode is octal, the other values are dec- 
imal. “ENOENT” is returned if the file 
doesn’t exist, and other strings contain 
a displayable error message. 


send cs fn 
rcv mdd checksum 
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send 
send 
send 
Icv 


send 
send 
rcv 


send 
cv 


send 
cv 
cv 
cv 


send 
Icv 


Return the 32-hex digit MD5 checksum, 
or an empty string if the file doesn’t ex- 
ist. 


pu fn 

user group mode mtime size 
(size bytes) 

OK 


Push a new file fn. It must not already 
exist. User and group are alphabetic, 
and currently ignored. Mode is octal, 
and mttme and size are decimal. The 
modification and access times are set to 
mtime, if allowed. Returns “OK” or a 
printable error message. 


md fn 
user group mode mtime size 
OK 


Create a directory with the given mode 
and mtime (if possible). User, group, 
and size are ignored. Returns “OK” or 
a printable error message. 


ls fn 
/fnt/fn3/.../ frnt/ 


Return a list of files in the given di- 
rectory, separated by slashes and termi- 
nated with a double slash. If fn isn’t a 
directory, doesn’t exist, or is empty, “//” 
is returned. 


ge fn 

OK 

size bytes 
(size bytes) 


Get a remote file fn. Returns “OK” or 
a printable error message. If OK, return 
the size of the file in bytes, and the con- 
tents of the file. 


ex 
OK 


Exit. 
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ABSTRACT 


The two paradigms of searching and browsing are currently almost always used separately. 
One can either look at the library card catalog, or browse the shelves; one can either search 
large WWW sites (or the whole web), or browse page by page. In this paper we describe a 
software tool we developed, called WebGlimpse, that combines the two paradigms. It allows 
the search to be limited to a neighborhood of the current document. WebGlimpse 
automatically analyzes collections of web pages and computes those neighborhoods (at 
indexing time). With WebGlimpse users can browse at will, using the same pages; they can 
also jump from each page, through a search, to ‘‘close-by’’ pages related to their needs. Ina 
sense, our combined paradigm allows users to browse using hypertext links that are 
constructed on the fly through a neighborhood search. The design of WebGlimpse 
concentrated on four goals: fast search, efficient indexing (both in terms of time and space), 
flexible facilities for defining neighborhoods, and non-wasteful use of Internet resources. Our 
implementation was geared towards the World-Wide Web, but the general design is 
applicable to any large-scale information bases. We believe that the concept of combining 
browsing and searching is very powerful, and deserves much more attention. Further 
information about WebGlimpse, including the complete source code, documentations, demos, 
and examples of use, can be found at http://glimpse.cs.arizona.edu/webglimpse/. 
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1. Introduction 


Browsing and searching are the 
paradigms for finding information on line. The 


two main 


search paradigm has a long history; search facilities 
of different kinds are available in all computing 
environments. The browsing paradigm is newer 
and less ubiquitous, but it is gaining enormous (and 
unexpected) popularity through the World-Wide 
Web. 
Search is sometimes hard for users who do not 


Both paradigms have their limitations. 


know how to form a search query so that it is 
limited to relevant information. Search is also 
often seen by users as an intimidating ‘‘black box’’ 
whose content is hidden and whose actions are 
mysterious. Browsing can make the content come 
alive, and it is therefore more satisfying to users 
who get positive reinforcement as they proceed. 
However, browsing is time-consuming and users 
tend to get disoriented and lose their train of 
thoughts and their original goals. 


These two paradigms are used separately by 
most systems. A site may give you a link to a 
search page that will allow you to search the whole 
site. But this search will not generally take into 
account where you are coming from, therefore it 
will not incorporate any knowledge you already 
gained while browsing the site. 


We argue that by combining browsing and 
searching users will be given a much more 
powerful tool to find their way. We envision a 
system where both paradigms will be offered all the 
time. You will be able to browse freely — the 
usual hypertext model — and you will also be able 
to search from any point. The search will cover 
only material related in some way to the current 
document. (Of course, global search may be 


offered too.) 


Suppose that you are looking for information 
on network research at a certain institution. You 
may first go to their home page. If you search for 
‘network’ throughout the institution, you'll get too 
many unrelated hits. If you search for ‘network 
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research’ you may get too few hits, because the 


word research may not appear in_ research 
documents (some keywords are often missing 
because they are implied by context). On the other 
hand, if you could search only the pages of the 
research department, or only pages related to 
research, then a query for ‘network’ may be quite 
relevant. You will be getting the context for your 
query from the pages where you initiated it. Such 
search is much better than the current most 
common way to get that information, which is to 
browse back and forth through dozens or hundreds 
of pages. After all, the computer is the one that is 


supposed to do most of the work, not you. 


Some simple facilities of focusing the search 
while browsing have been employed. We have 
attempted limited browsing and searching facilities 
in our GlimpseHTTP package [1]. Since the work 
on GlimpseHTTP has not been published (aside 
from the code), we describe here some of the 
features that WebGlimpse borrowed. 
GlimpseHTTP indexes a UNIX file system, and 
provides search that can be focused to any part of 
the directory tree. For example, if you are looking 
for a Computer Science citation (our most popular 
demo [2] of GlimpseHTTP), you can browse one of 
16 different categories and perform the search only 
for the current category. Only one index is needed 
for the whole archive; the CGI program identifies 
the page from which the search originates, and 
limits the search accordingly. The search pages are 
created automatically by GlimpseHTTP. If the 
information is hierarchical, and it is organized in a 
corresponding hierarchical file system, 
GlimpseHTTP works well. But its browsing and 
searching capabilities do not apply to arbitrary links 
(or to sites with information that is not neatly 
organized in a tree structure). GlimpseHTTP has 
been very successful; it is used in more than 500 
sites [3], some quite major. WebGlimpse is a 
natural extension of GlimpseHTTP, borrowing 
some of its features but taking them much further. 
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The idea of searching only parts of a 
hierarchy is now used by Yahoo [4] to search only 
one category at a time. This is done in a very 
similar fashion to GlimpseHTTP, and it suffers 
from the same problem. For example, if you search 
for ‘‘mentally ill’’ under the ‘‘Arts’’ category, you 
will miss a site that features a ‘‘collection of art by 
mentally ill people’’. The reason for that miss is 
that this site is listed under ‘‘Arts Therapy’? which 
is listed under Arts, but only as a symbolic link (it 
is located in the ‘‘Mental Health’’ category). 
Yahoo probably made a good design decision not to 
follow symbolic links in the search, because 
following several such links may result in unrelated 
material. But following only one symbolic link 
from any subcategory would probably have been 
better. This is the kind of feature that WebGlimpse 
easily provides. 


Other web search servers take a different 
approach. Both InfoSeek [5] and Lycos [6] offer 
‘Find Related Sites.’’ The relationship is 
computed by searching the whole database for 
keywords similar to the major keywords of the 
given document. This is a traditional information 
retrieval approach, and it can be quite effective, but 
it can also lead to unusual results. (For example, 
we tried this feature in Lycos and looked for related 
sites to ‘‘Accumulated Accordion Annotations’’ in 
the ‘‘Entertainment & Leisure: Music: Instruments: 
Individual Instruments’? category. The second 
match was ‘‘Comprehensive Epidemiologic Data 
Resource WWW Home Page’’ in the ‘‘Health & 
Medicine: Medical Research: Libraries, Databases 
& Indices’’ category. The similarity was the word 
The third match was 
“‘Space & Astronomy: Image Files & Archives’’ 


‘“‘accumulated.”’ in the 
category and it apparently resulted from sharing the 
‘‘annotations.’’) The idea behind 
WebGlimpse is not only to utilize neighborhoods, 
but to allow search with additional keywords on 


word 


these neighborhoods. In the ‘‘accordion’’ example 
above, we may want to search pages related to 
accordion annotations and also related to jazz. 


With WebGlimpse we would have to input only 
‘jazz.’ Such a query would probably have filtered 
out the unrelated documents that somehow ended in 
those neighborhoods. 


Another example of focusing search to 
limited domain based on current browsing is the 
WwWwW-Entrez system [7] from the National Center 
for Biotechnology Information, which precomputes 
neighborhoods within the nucleotide database and 
within the Medline database and presents fixed 
links from all pages to their neighborhoods. The 
determination of sequence and text neighborhoods 
for the millions of records in the database is 
computationally intensive requiring weeks of CPU 
time. Like the Lycos and InfoSeek examples listed 
above, Entrez allows one to quickly explore a 
neighborhood from any given document, but it does 
not provide search within neighborhoods. We are 
told that they are considering adding this feature. 


The next section describes the design and 
implementation of WebGlimpse. Section 3 
presents applications of WebGlimpse, and Section 
4 ends with conclusions and further work. 


2. WebGlimpse Design and 
Implementation 


2.1. Overall Architecture 


In a nutshell, WebGlimpse works as follows. At 
indexing time, it analyzes a given WWW archive 
(e.g., a whole site, a collection of specific 
documents, or a private history cache), computes 
neighborhoods (according to user specifications), 
adds search boxes to selected pages, collects 
remote pages when relevant, and caches those 
pages locally. Once indexing is done, users who 
browse that site can search from any of the added 
search boxes and limit their search to the 
neighborhood of that page (or search the whole 
archive). In a sense, WebGlimpse transforms the 
archive into an easier-to-navigate hypertext. As its 


name suggests, WebGlimpse uses our Glimpse 
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search engine, which was modified slightly to add a 
few features useful for WebGlimpse. Only one 
index is needed for each archive (e.g., for each 
site). The focus 
collection of remote pages are done in an efficient 


to neighborhoods and _ the 


way at indexing time, so search is always fast. 


WebGlimpse consists of several programs 
which perform five main steps. The first four are 
performed by the publisher) 
administrator to set up an archive, and the last one 


server (or 


is the actual user search of an existing archive. 
Their main features are given below, followed by 
more detailed descriptions. 


Analysis of a given archive 
Starting with a set of root URLs, this stage 
traverses local and remote pages” reachable 
by a path of links of a given maximum 
The links 
contained in each page are extracted, and 


distance from the initial set. 


their corresponding pages are then followed. 
The end result of this stage is a full graph of 
the whole archive, where the edges of the 
graph are the HTML links. The limit on the 
length of the traversed path can be set 
differently for local and remote pages. For 
example, one can allow unlimited distance on 
the local pages, but only a distance of 2 at 
any remote site. 


Collection of remote documents 
Non-local URLs are fetched and saved in a 
mirror file system. This is an optional step. 
Sometimes, local archives can be nicely 
complemented with data from remote 
sources. For example, with WebGlimpse one 
can collect in an archive a list of ‘‘favorite 
pages,’’ or simply one’s bookmarked pages. 
The links from these pages, and in general 
their structure, are preserved. This mirror file 


system can serve as a ‘‘hypertext book’ 


2 We will use the term page to denote an HTML document 
corresponding to one URL. 
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collected from the web. 


Neighborhoods computations 
Depending on how neighborhoods are 
defined, 
interleaved with the first step) builds all the 
neighborhood files to help with the search. 
We will discuss neighborhoods in detail in 


this step (which in practice is 


Section 3. 


Addition of search boxes to selected documents 
Selected non-empty 
neighborhoods are identified and modified by 
an addition of an HTML code which 
provides the search facilities. It is possible to 


documents’ with 


define which pages will be selected in a 
flexible manner. 


Search 
Glimpse is used for all the search routines. 


We'll start with a description of Glimpse, because 
it serves as the basis for the whole design. 


2.2. Glimpse 


The search engine for WebGlimpse is glimpse [8] 
(which also serves as the default search engine in 
the Harvest system [9]). For completeness, we'll 
mention here the features of glimpse that are 
relevant to the searching/browsing problem. 
Glimpse is an indexing and search software, 
designed slightly different than most other indexing 
systems. Glimpse’s index consists essentially of 
two parts, the word file and the pointers file. The 
word file is simply a list of all the words in all 
documents, each followed by an offset into the 
pointers file. The pointers file contains for each 
word a list of pointers into the original text where 
that word appears. A search typically consists of 
two stages (some searches can do with only one): 
First, the word file is searched and all relevant 
offsets into the pointers file are found. The relevant 
pointers (to the source text) in the pointers file are 
collected. The second stage is another search, 
using agrep [10], in the corresponding places in the 


original text. This is similar in principle to the 
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usual inverted indexes, except that the word file, 
being one relatively small file, can be searched 
sequentially. This allows glimpse to support very 
flexible queries, including approximate matching, 
matching to parts of words, and_ regular 
expressions. These flexible queries are 
implemented by running agrep directly on the word 
file. The fact that the files are searched directly 
allows the user to decide on-the-fly how much of 
the match to see. Glimpse’s default is to show one 
line per match (as in grep), but it can also show one 
paragraph or any user-defined record. This gives 
context to every match. 


The second advantage of this design is that 
the pointers file can be built in many different 
ways. In particular, the granularity of the pointers 
— the precision of where they point to — can be 
set arbitrarily. The pointers can be to the exact 
locations of the words, which is similar to regular 
inverted indexes, to the (files) 
containing them, to whole directories, etc. The 


documents 


larger the granularity the more work will need to be 
done in the second stage (where the source is 
searched directly), but the smaller the index will be. 
Glimpse supports three types of indexes: a tiny one 
(2-3% of the size of all files), a small one (7-9%), 
and a medium one (20-30%). In the medium-size 
index the pointers point to exact locations, in the 
small one the pointers are just file names, and in the 
tiny one the pointers are to blocks of files. 


The pointers file has another feature. Its 
pointers are indirect. They are indexes to yet 
the filenames file, which 
contains the list of all indexed files. To support 


another file, called 


WebGlimpse better, we added an option to glimpse 
to store with each HTML file name its title aad, if 
relevant, its URL. WebGlimpse obtains this 
information directly from the result of a glimpse 
search. A typical search first gets offsets into the 
pointer file, from there it gets the indexes to the 
filename file, from there it collects the file names 
(and possibly the titles and URLs), and then in the 
final stage it searches those files directly. The 


performance of glimpse depends on the complexity 
of the queries and the size of the index. Simple 
queries with words as patterns and Boolean 
operators is optimized by using hashing into the 
word file. Such queries generally take less than a 
second. More complex searches obviously take 
more time, but it is worth the extra flexibility. 
GlimpseHTTP was compared with WAIS [11] 
(independently of us), with the conclusions that 
“*system administration and 
maintenance and, importantly, user 
functionality is dramatically better than WAIS for 


requirements, 
most 


both searching and browsing.” 


Glimpse is particularly suitable for our 
purposes because it supports very flexible ways to 
limit the search to only parts of the information 
base. Since the search is divided cleanly into two 
stages, searching for the words, and then going to 
the appropriate files, we can filter files in the second 
stage. This is done through two options in the 
search: —f and —F (the former added specifically for 
WebGlimpse). The first one reads a list of file 
names from a given file and uses only those files. 
The second one uses the full power of agrep to 
filter file names by matching. We'll describe later 
how WebGlimpse uses these options to limit the 
search. 


2.3. Scanning the Archive 


A WebGlimpse archive is built with a script called 
confarc. The script first sets up the right paths to 
the archive, the URL of the cgi-bin programs, the 
title of the archive, and other administrative details. 
More important parameters to confarc are the initial 
URLs from which the traversal of the archive will 
begin, whether or not non-local URLs should be 
collected, and the definition of neighborhoods. The 
traversal of the archive is done in a straightforward 
way. Each HTML document is analyzed to extract 
all its outgoing links. All links are checked for 
locality (using IP numbers), and traversed using 
depth-first search (we may change that to breadth- 
first-search in later versions). ©WebGlimpse 
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provides flexible facilities, through the use of 
regular expressions, to define which URLs/files 
should be included or excluded from the archive. 
For example, one may decide to exclude from the 
archive any files in a directory called /private/, 
except for one file there called ‘‘told-ya.html’’. 
The syntax for exclude/include follows the one 
used by the Harvest system [9]. 


2.4. Collecting Remote Pages 


WebGlimpse provides facilities to collect remote 
URLs and include them in the indexing and in the 
search. A neighborhood can therefore include 
remote URLs, and users may jump through a 
neighborhood search to remote sites. We believe 
that such a facility is very important, and is sorely 
missing from local search facilities. Including 
remote pages in the search can help users make 
connections that would have been much _ harder 
otherwise. If you are looking for ‘network 
research’ you may find that someone is working on, 
say, stochastic analysis and has a remote link to 
references about network research. Or if you are 
looking for military accidents involving computers, 
you may find a reference to a helicopter incident 
with remote links to discussions on the computers 
aboard. This feature makes the search a global 


hypertext search. 


The remote URLs that WebGlimpse collects 
are saved in separate files with a mapping 
mechanism from URLs to file names so the 
indexing and search can be done transparently. The 
original content of these URLs is not discarded. If 
space is a problem, however, then these mirror files 
can be removed, and the search will still be 
possible, because they have been incorporated into 
the (small) index. Glimpse has been modified to 
take this into account and provide a limited search 
(e.g., without showing the matching lines, which 
are not there anymore) when the original content is 
not available; it will still show the original URL 
(which is actually the typical way to show results 


on the web). We originally did not allow a 
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recursive collection of remote URLs, so that 
WebGlimpse would not be used as a robot. 
However, based on user feedback, and the fact that 
a multitude of robots already exist and are very 
easy to deploy, we now allow arbitrary remote 
collection. WebGlimpse abides by the robot 


exclusion rules. 


2.5. Neighborhoods 


When WebGlimpse traverses an archive, it also 


computes neighborhoods. The current version 
supports two types of neighborhoods. The first 
consists of all pages within a certain distance (the 
default is 2) of the current page. The idea is that 
WWW pages are often written with links to related 
pages, so the neighborhood concept is implicit. 
The second type consists of all subdirectories 
(recursively), similarly to the way GlimpseHTTP 
operates. The implementation of neighborhood 
search is quite simple. The list of all file names 
(recall that even remote URLs are mapped to local 
file names) that constitute a neighborhood of a 
given page is kept in one file (whose name is 


mapped to that page). 


When a search is triggered, glimpse first 
consults the main index, finds the list of files with 
relevant matches, then intersects that list with the 
neighborhood of the given page. The neighborhood 
list is fetched at query time, and the index does not 
depend on it in any way, which allows easy access 
and easy modification if needed. For example, if a 
certain file or directory becomes irrelevant for some 
reason, all one has to do is delete its name from all 
neighborhood files (or from some). 


In the current working version § of 
WebGlimpse we added a compression of the 
neighborhood files to save additional space. This 
compression, which is especially designed to work 
with the glimpse index, is computed at indexing 
time through an extra program. The compressed 
neighborhood files are used directly with the 
glimpse search, so their decompression on the fly 
during the search is very quick. WebGlimpse 


USENIX Association 


USENIX Association 


includes the compression and decompression 
programs, so if one wants to change any of the 
neighborhoods or generate them through another 
program, it can be easily done. We expect people 
to write scripts for generating neighborhoods, and 
this design makes such scripts completely 
independent of WebGlimpse itself. They can be 
run as post-processing steps. The neighborhood 
lists can further be compressed (and decompressed 


on the fly) to save space if that becomes a concern. 


We are working on other types of 
neighborhoods we well as on general tools to allow 
people more flexibility to define their own types. 
One idea is to allow several neighborhoods for one 
page so that users can decide which one to use at 
query time. There will still be one neighborhood 
file per page, plus sets of ranges into it (e.g., 
smallest neighborhood contains files 1-23, second 
one contains files 1-39). Using classification or 
clustering tools, one will also be able to define a 
neighborhood as ‘‘all pages in this site that are in 


the same area.”’ 


2.6. Search Boxes 


To further integrate browsing and_ searching, 
WebGlimpse automatically adds small search 
‘‘boxes"’ to selected pages. An example of such a 
search box is shown in Figure 1. The HTML code 
for including the box is given in a template file; it 
can be easily changed to fit the preferences of the 
archive maintainer. The boxes are added at 
indexing time by adding HTML code (essentially a 
FORM field) to the original pages. The boxes are 
added at the bottom of pages, and special markers 
(implemented as HTML comments) are also added 
so that the boxes can be easily removed 
(WebGlimpse includes a program to do that). It is 
possible to customize where the boxes are added by 
simply moving these markers. The default is to add 
the boxes to all HTML pages with non-empty 
neighborhoods. The same kind of exclude/include 
facilities are available for selecting these pages. 


(Obviously, if a page is excluded from indexing no 


search box is added to it, but the opposite may not 
be true in some cases.) The typical approach on the 
web is to provide a link to a separate search page. 
We decided to add boxes to all pages so that users 
do not need to go anywhere else for search. This 
minimizes the ‘‘context switching’’ and keeps users 
focused. They can see the content of the page 
while they compose the query. 


Each box also includes a dynamic link to 
advanced search options. It’s dynamic in the sense 
that it is generated on the fly and can also display 
the neighborhood (also generated on the fly). The 
search options interface is shown in Figure 2. The 
advanced options include old glimpse options like 
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‘‘case sensitivity,’ ‘‘partial match,’ and ‘‘number 
of spelling errors,’’ as well an option to jump 
directly to the line of the match (more on that in the 
next section). A very nice new option allows users 
to search only files updated within the last X days. 
The downside of adding boxes is that pages are 
modified, which some users may object (especially 
in a shared environment), and if pages are printed 
there is a little extra to print. We believe that in 
most cases this is still worthwhile. (Of course, it is 
easy to modify the box to make it just one link to 
the advanced search.) 


2.7. The Output 


The output of a query is a set of records, one for 
each matching file. WebGlmpse formats the 


results in four ways: 
Context for each match 


WebGlimpse outputs the title of each matching 
URL with a link to it (if the file name corresponds 
to an HTML document), and, since glimpse 
provides the matching lines or records and not only 
the file names, all the matching lines or records are 
output too. In general, the records provide quick- 
and-dirty context for matches. 
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Figure 2: WebGlimpse’s search options 
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Figure 3: An example of WebGlimpse’s output (searching for ‘‘privilege and license’) 
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Providing line numbers 


Glimpse can compute the right line number for 
each match, and WebGlimpse has an option 
(borrowed again from GlimpseHTTP) to bring the 
documents automatically to that line number. This 
is done by modifying the HTML document on the 
fly to insert the corresponding anchor, which is not 


trivial because some links may need to be 
recomputed as well. 

Highlighting key words 

All the matched keywords are highlighted 


(formating them to bold in HTML), both in the 
output records and, in case line numbers are used, 
in the files themselves if the links are followed. 


Showing dates of modification 


Starting at version 3.5, glimpse can provide dates 
for each file, filter by dates, and show them. 
WebGlimpse uses these features. 


An example of an output (one match of a search for 
“‘privilege AND license’’ from our demo of the 
Arizona legislature pages) is given in Figure 3. 
(The date refers to our copy, not the original 
document.) 


2.8. Experiments and Experience 


Our experience with WebGlimpse is still quite 
limited. We give here some conservative numbers 
that we obtained, using a very complex archive. 
All experiments were run on a DEC Alpha 233 
MHz with 64MB of RAM. 


The archive for this experiment was the 
pages of the Arizona Legislative Information 
System, which includes information about budget, 
constitution, floor 
space 


committees, statutes, the 
calendars, 
consuming) full text of bills. This archive occupies 
152.5 MB and about 20,000 files. The number of 


and we selected the 


agendas, and (the most 


links was quite high, 
neighborhoods to be within 3 hops of the given 


page. (In hindsight, this is too much. The 
neighborhood files became too big and search was 
not sufficiently restricted; the indexing stage was 
also too slow as a result. But it gives a worst-case 
archive for its 


scenario.) We _ selected this 


complexity and the multitude of links. 


The complete indexing process took 3 hours 
and 5 minutes. It took about an hour to traverse the 
whole archive and compute all the neighborhoods, 
an hour to analyze all HTML files and add the 
search boxes (we added search boxes to all files 
that had non-empty neighborhoods — about 85% 
of all files), 22 minutes to index the whole thing, 
and 41 minutes to compress all the neighborhood 
files. (We should mention that the glimpse 
indexing and compression is done with C, and 
everything else is done with Perl.) We believe that 
this is indeed close to the worst case, because the 
neighborhood structure was so complex (some 
neighborhoods, especially ones from the floor 
calendar files, contained thousands of links). The 
archive after indexing occupied 205.8 MB, a 34% 
increase, which was divided about equally between 
the index, the (compressed) neighborhood files, and 
the added search boxes. 


Query times depend heavily on the type of 
queries. We run a few experiments with typical 
queries. As expected, we found no significant 
difference between a whole archive search and 
neighborhood searches. The running times, 
measured from the time a user clicks on the search 
box to the time the results page appears on the 
browser, range from 2 seconds for a query that 
matches less than 10 hits, to 12 seconds for a 
complex query with many hits. The corresponding 
times for pure glimpse search are less than half 
that; the rest of the time is taken to compose the 


HTML result page. 


These numbers are taken from an early 
version of WebGlimpse. We put most of our 
efforts into making WebGlimpse work in a flexible 
manner. We expect the performance numbers to 
improve as we tune the code. More information 
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will be posted to the WebGlimpse home pages 
(http://glimpse.cs.arizona.edu/webglimpse/). 


2.9. Design Decisions 


We discuss here briefly the rationale behind the 
major design decisions we made. As we said in the 
abstract, our main four goals were fast search, 
efficient indexing, flexible facilities for defining 
neighborhoods, and non-wasteful use of Internet 
resources. 


Our first design decision was to have fixed 
neighborhoods, computed at indexing time. There 
has been a lot of discussion lately of search agents 
that will traverse the web looking for specific 
information. While the ability to explore in real 
time is very attractive, we believe that at the 
moment the web is too slow and the bandwidth is 
too narrow to support wide deployment of a large 
number of such wandering agents. WebGlimpse 
can be thought of as an agent for servers, but users 
still search fixed indexes residing in one place. 
This makes the search much faster, but it limits its 
flexibility. In particular, neighborhoods in 
WebGlimpse cannot be defined by the users. Itis 
possible, however, for the server maintainer to 
provide several different neighborhood definitions 
and to let the users choose between them. We have 
not yet implemented such an option. 


The ability to define any neighborhood one 
wants is important. We put emphasis in the design 
to unlink the neighborhood definition and use from 
the rest of the system as much as possible. The 
neighborhoods files are consulted only at the end of 
the search process, and they are not integrated into 
the index in any way. 


Fast search always conflicts with efficient 
indexing. The more you spend on_ indexing, 
especially more space, the faster the search. 
WebGlimpse was designed for small to medium- 
sized archives, and, like glimpse, it puts indexing 
efficiency slightly above search _ speed. 
Nevertheless, the search is quite fast, and indexing 
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can sometimes take quite a long time. Indexing can 
be slowed down by two features: 1) having to fetch 
many remote documents (nothing we can do about 
that), and 2) having to compute and manipulate 
complex large neighborhoods. 


User interface is very important to any search 
application. We believe that our design of the 
output of queries — with the inclusion of matched 
lines, highlighted keywords, and dates of 
modification — will be very helpful. Being able to 
quickly judge the relevance of the output to one’s 
query is a problem in many search services today. 
One often finds oneself spending considerable time 
following up on the multitude of links that are 
returned as results of queries. 


We believe that making the search box 
available all the time, rather than the more typical 
link to a different search page, is a good idea. A 
search box does not take much ‘‘real-estate,”’ and it 
allows users to compose their search query while 
looking at the page. We received comments from 
web administrators who do not want to modify 
existing pages to add the search boxes. This is a 
valid concern, especially in sites where pages are 
owned by many people. But many sites strive for 
coherent design and adding boxes is not much 
different than requiring a certain format or adding 
uniform links to the main search page. We provide 
easy tools to add and remove those search boxes at 
any time. 


3. Applications 


The main application of WebGlimpse is, of course, 
to provide search for collections of hypertext files. 
We foresee several other related applications. 


3.1. Building Personal or Topical 
Collections 


In a sense, WebGlimpse is a ‘“‘light’’ version of 
Harvest [9]. It does not have the full power of 
collect massive 


Harvest to automatically 


information from given sites and extract the 
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But it does 
specific 


indexed text from different formats. 
allow one and organize 
documents that are relevant. 
Harvest, the link structure of that information is 


to collect 
In contrast with 


kept. (Harvest, like most global Internet search 
facilities moves everything it collects into a flat 
structure.) Hypertext ‘‘books’’ can be written 
much more easily with WebGlimpse, which will 
collect and index all links (citations) and allow 
flexible browsing through the whole book. Lists of 
“interesting sites’? are commonly kept by many 
people, some are quite substantial. WebGlimpse 
will allow the maintainers of such lists to easily 
make them searchable with full-text search. 


3.2. History and Cache Files 


Most web browsers cache the pages they fetch. 
Like typical caches, this 
transparent to the user, and is done for performance 


is done completely 


purposes only. But having this large set of mostly 
relevant pages can be very helpful. For example, 
two years ago we designed a system called 
Warmlist [12] that cached pages per user demand, 
indexed them, organized them, and provided full 
text search. It was meant to be a natural extension 
of the ‘‘hot list’’ concept. There are now similar 
commercial products, which also allow to cache 
everything automatically and to view and navigate 
based on this cache. This becomes a history feature 
more than just a cache. With WebGlimpse, you 
can easily construct an archive of your history list. 
Not only can you browse and search it, but you 
may also discover relationships between pages — 
by viewing the neighborhoods — or other context 
information. 


3.3. Visualization and Custom- 


ization 


Combining WebGlimpse with graph drawing 
packages (such as [13] and [14]) will allow for 
better visualization of the hypertext structure. 
Imagine adding to the results of queries some 


summaries of the documents, icons of them, or 
other useful information of the kind you find in 
static pages with links to related documents. When 
you visit a page and perform a query (and queries 
with WebGlimpse can be simpler because the 
context is kept and the domain of the search is 
smaller) you get a customized view of the ‘‘way 
ahead’’. This view may be just as good in terms of 
information as the static view, but much more 
relevant to you. In other words, you as_ the 
navigator can build your own hypertext part of your 
way, customized to the areas of interest to you. In 
particular, it would be very interesting to see how 
to combine our fixed neighborhood search with the 
ideas of scatter/gather [15], or an automatic 
classification system like AIM [16]. 


4. Conclusions 


Searching the web is growing quickly from an 
infancy stage. The current facilities are quite 
amazing and very useful, but they are far from 
being the last word or even a good word. Finding 
useful information on the web is still a very 
frustrating process. We believe that more methods 
need to be attempted, more prototypes employed 
and experimented with, and more paradigms need 
to be explored. WebGlimpse presents one such 
attempt. It is simple, easy to build, natural to 
follow, and flexible so it can be extended. 
WebGlimpse is part of the FUSE (Find and USE) 
project at the University of Arizona 
(http://www.cs.arizona.edu/fuse/), where we are 


working on other methods for the same problems. 
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Abstract 


Electronic mailing lists are a common forum for dis- 
cussion on the Internet. Each participant receives a 
copy of each message posted to the list with an addi- 
tional copy typically archived at a well-known site for 
later retrieval. Busy lists and long-running lists can 
quickly accumulate many messages making it diffi- 
cult to locate information. Few sites provide mecha- 
nisms for the efficient storage or retrieval of 
information from these archives. The Mailing List 
Archive (MLA) tools were designed to address these 
issues. Mail messages are stored in a space-efficient 
manner and the archive information is processed to 
create a database that is optimized for fast queries. 
The MLA tools are designed for use as Computer 
Gateway Interface (CGI) applications so that archives 
can be accessed through HTML forms on the World 
Wide Web. Concurrent update and retrieval are an 
integral part of the design so that up to date informa- 
tion is always retumed in query results. 


1.0 Introduction 


UNIX mailing list archives are typically unstructured 
files that hold the concatenated mail messages stored 
in the order in which they were received. This simple 
format makes it easy to apply text-oriented UNIX 
tools such as grep and awk but fails to capture much 
of the structure of the discussions that transpire on a 
list. Furthermore, when mailing lists are especially 
active archives can grow very large, causing text-ori- 
ented tools to slow down noticeably. Bulletin board 
systems such as Usenet together with threaded reader 
programs attempt to address these problems but 
require nontrivial administration and the use of 
involved packages for transmitting and storing mes- 
sages. For many discussion lists electronic mail is the 
preferred medium, and what is needed is a simple to 


use backend system for archiving messages. While it 
is possible to feed a mailing list into the normal 
Usenet software for the purposes of creating an 
archive, the MLA tools were created as a much sim- 
pler solution that is easier to maintain and more effi- 
cient. 

The MLA tools consist of a program that stores 
mail messages in a mailing list archive database and 
several programs for doing queries from a database. 
Additional tools are provided for doing administrative 
tasks such as removing messages that are old (i.e. 
expired) and for optimally compressing a database 
(most compression is, however done on-line). User 
access to an archive is possible through command- 
line programs, electronic mail, or HTML forms on the 
World Wide Web. Message relationships (i.e. threads 
of discussion) are maintained and query and 
navigation tools preserve this threading structure even 
when navigation is done with an HTML browser. 

The MLA tools differ from other packages in 
several important ways: 

¢ They work with relatively unstructured mail 
messages (as opposed to news postings, which 
have structured information that simplifies 
cross-referencing messages). Programs such as 
hypermail [l] also work directly with mail 
messages but do not do as good a job of 
deducing message relationships, do not work 
on-line, and do not provide a query interface. 

e They are intended to be simple to use in 
unprivileged environments. The tools require 
no special-purpose protocols or system 
registration of services. 

e They are designed to efficiently support 
database updates without impacting concurrent 
queries. The MLA tools are designed with the 
express intent of doing on-line updates for each 
mail message. 
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¢ They are simple to use within the framework of 
the World Wide Web by way of a template- 
based facility for generating HTML documents. 
The MLA tools might best be contrasted with hybrid 
systems built from an indexing engine and a CGI 
interface to a threaded news server. A system such as 
ni [4] has similar functionality but lacks the naviga- 
tional support provided by the MLA tools. 

The next sections of this paper present the tools 
and provide examples of how they are used. 
Subsequent sections discuss the design of the 
database used to store archives, issues in parsing and 
building threaded mailing list archives, and issues in 
building a useful navigation interface for the World 
Wide Web. These are followed by a section reporting 
performance results and, finally, a summary and 
discussion of future work. 


1.1. Creating a Mailing List Archive 


The mlaupdate tool is used to create and update (add 
messages to) an archive. Typically, an MLA is created 
from an existing archive of mail messages and then a 
feed is setup so that new traffic is directly entered into 
the database. Old message archives are assumed to 
come in three forms: mbox-style files in which mail 
messages are concatenated and the UNIX-style 
“From” message lines are still present, MH folders 
where each mail message is written to a separate file 
and tagged with a “*Delivery-Date” header line, and 
Usenet news articles where a “Newsgroups” line is 
treated as a ““To” address. To convert an mbox-style 
file to an MLA, a command of the form: 


mlaupdate foo-archive 


might be used. This creates an MLA in the current 
working directory, if it does not already exist, and 
enters the mail messages found in the file foo-archive. 
mlaupdate can also read from standard input; for 
example, 


zcat foo-archive.Z | mlaupdate - 


can be used to supply a compressed archive to mlaup- 
date. 

Alternatively, the following enters all the 
messages from an MH folder: 


mlaupdate -y ~/Mail/inbox/* 


(The -y option to mlaupdate forces it to not scan the 
input text for an initial message separator pattern, typ- 
ically a UNIX-style “From line” as generated by the 
normal system mail delivery program. In this case the 
-y option is needed because MH removes UNIX-style 
“From lines” when mail messages are incorporated 
into folders.) 
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Messages are stored in a compacted form with 
the useful header information written to a table of 
contents database file that is optimized for searching 
and the body wiitten to a message-body database file. 
Message bodies that exceed a certain threshold size 
are written to separate spillover files instead of the 
message body database to optimize space usage in the 
hashed message body database. All message bodies, 
whether they appear in the database file or in a 
spillover file, are stored in a compressed form using a 
PKZIP-style compression algorithm [8]. 

In addition to the message header information, 
the table of contents database also includes inter- 
message relationships such as whether a message is a 
reply to another message. This information is used 
to build thread relationships between messages: all 
messages that appear to be part of a single thread of 
conversation are identified and query results may 
provide messages organized as threads, as opposed to 
unrelated collections of messages. 


1.2. Feeding an Archive 


A key component of the MLA design is the ability to 
directly connect a live feed to the database. That is, 
mailing list traffic can be immediately entered into an 
archive with the thread-related information automati- 
cally updated and available for queries. The typical 
way to effect this is to setup a mail alias that invokes 
the mlaupdate program to enter each mail message in 
an archive. Alternatively, mail messages can be fed to 
mlaupdate through personal delivery tools such as the 
slocal [3] or procmail [5] programs. Each linkage 
mechanism has advantages and disadvantages; most 
importantly, delivery via a mail alias may require that 
the MLA files and containing directory be owned by 
the user that invokes mlaupdate. Invoking mlaupdate 
through personal delivery tools permits file protec- 
tions to be setup in a more private manner. The send- 
mail delivery route can be overcome by using a setuid 
wrapper program that invokes mlaupdate. Note also 
that ownerships and protections must permit read 
access to the CGI applications invoked by the HTTP 
server. 

The following .maildelivery entry might be used 
with the MH slocal program to feed incoming 
messages for a “flexfax” mailing list directly into an 
archive: 


sender owner-flexfax | R \ 
“mlaupdate -y -d arch/fax ~” 


(the -y option is supplied in case slocal removes the 
From’ line; see above). An equivalent mail alias 
would be: 
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Figure 1. Form generated by mlaform. 


flexfax-archive: “|/bin/mlaupdate \ 
-d /fax/Archives -” 


Note that in this situation the -y option is not needed 
because the “From line” is known to be present in the 
message delivered to mlaupdate. 

If a mailing list has a very high traffic rate and 
archive updates cause excessive locking of the 
database, then the mlabatch program can be used in 
place of mlaupdate. mlabatch works by holding onto 
mail messages for a period of time before entering 
them in a database. If multiple messages are received 
before an update is scheduled, then they are batched 
together and only a single update of the database is 
done (thereby reducing the time the archive is kept 
locked). mlabatch supports all the same command line 
options that mlaupdate does as well as two additional 
options for controlling the batching operation. Thus 
an archive feed can be converted to use batching 
simply by invoking mlabatch instead of mlaupdate. 


1.3 Querying an Archive Through the 
World Wide Web 


Three tools are provided for querying an archive 
through the World Wide Web: mlaform, mlaquery, 
and, mlafetch. These programs comply with the Com- 
puter Gateway Interface (CGI) protocol [12] and so 
may be invoked directly from other HTML pages. 
mlaform is used to generate an HTML page from 
which a query is constructed. mlaquery implements 
database queries returning a specially formulated 
HTML page that includes a hidden form of the query 
result in each HTML link. mlafetch retrieves a mail 


Figure 2. Query result generated by mlaquery. 


message from an archive, also encoding query results 
in links for navigational purposes. In normal use, a 
user is presented a query form generated by mlaform 
(see Figure 1), constructs a query that is performed by 
mlaquery (Figure 2), and then retrieves articles using 
mlafetch (Figure 3). As such the design of these three 
programs assumes that they will be used together. 
mlaform is typically used as the top-level 
interface to an archive. It generates an HTML page 
based on a template file and the underlying MLA 
database. Template files are HTML documents that 
contain escape codes that are replaced by an MLA 


Figure 3. Article retrieved by mlafetch. 
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tool. Interesting information such as the number of 
messages in the archive and the ages of messages in 
the archive are available through escape codes. 
Escape codes are also provided for formulating 
queries that are to be done by mlaquery. The fragment 
of the flexfax archive template shown in Figure 4 
gives the general flavor of an MLA form template. 
mlaquery supports HTML query-style forms; it 
takes its arguments from the command line or through 
shell environment variables as specified by the CGI 
specification. Searches are done over a single archive 
with queries constructed from phrases that constrain 
the returned messages: 
¢ match a regular expression or boolean 
combination of expressions against mail 
message header lines (e.g., messages from 
“sam@.*sgi.com’”), 
¢ consider messages within a specific time frame, 
and 
¢ ignore upper-lower case text distinction in 
matching strings. 
The matching messages are returned collated accord- 
ing to thread, author, subject, or date. The maximum 
number of hits returned can be controlled as well as 
the maximum depth of the threading information 
returned when query results are collated by threads. 
The HTML document generated by mlaquery 
includes an encoded form of the guery hit set—the set 
of messages that satisfied the query. This information 
is Critical to the navigational support provided by the 
tools and is described more in Section 2. 3. 
mlafetch retums a formatted message together 
with links to other messages in the query hit set and to 
other messages in the database that are related to the 
message; e.g., if a message is a reply to another 
message then the original message can be viewed 
even if it was not matched by the query. mlaform 
propagates the encoded information passed to it by 
mlaquery so that the context of the original query is 
always present. Like mlaform and mlaquery, the 
format and content of the HTML is defined by a 
template file that administrators can tailor to their 
needs. 


Figure 4. Sample MLA form template. 


<PRE> 
<B> Prom: </B>%+F 
<B> Subject :</B>%+S 


<B>Starting Date:</B>%+< <I>until</I> 

<B> Ending Date:</B>%+> 

<B> Sort Hits By:</B>%+C <INPUT TYPE=submit 
VALUE=’Search Archives”>/ <INPUT TYPE=reset 
VALUE="Reset Query’> 

</ PRE> 
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14 Administering an Archive 


Archives are for the most part self-maintaining. Data 
structures are designed to grow efficiently as new 
messages are added to the database. There are, how- 
ever some components of the database that can be 
compacted if the database is manipulated off-line and 
tools are provided to do this. Otherwise, there are 
tools to delete messages from an archive and low- 
level tools for interrogating the internal data struc- 
tures used in an archive—mainly for the purpose of 
debugging problems. Each MLA is stored in a sepa- 
rate directory and may be copied with the normal 
UNIX command line tools. However using programs 
such as cp or tar to copy an MLA may generate a copy 
that uses more disk space than the original because 
the hashed database package used by the MLA tools 
utilizes holey files. If this is an issue, the mladump 
tool can be used to emit an ASCII form of the archive 
that can then be fed back into mladump. Thus an 
archive can be copied by commands of the form: 


mladump ~d foo.orig first-last |\ 
mlaupdate -d foo.copy - 


2.0 MLA Design and 
Implementation 


The MLA tools were designed to be efficient both 
for queries and for updates, but the expectation was 
that far more queries would be done than updates. It 
was very important however, that on-line updates be 
supported so that queries always return current 
information; otherwise, the tools would not be used in 
place of existing facilities. 

An archive consists of at least two files: the table 
of contents database, and the message body database. 
Message bodies that are too large to fit directly in the 
message body database file are written to spillover 
files, with the entry for the message setup to reference 
this file. The table of contents file is the critical data 
structure in obtaining high performance for both 
queries and updates; it is managed with a special- 
purpose package. The message body database is 
implemented using a_ publicly-available hashed 
database package [9]. Message bodies are compressed 
with a PKZIP-style compression algorithm whether 
they are stored directly in the database file or in 
spillover files; this is done with the publicly available 
zlib package [10]. 


2.1 Table of Contents Database 


The table of contents database (TOC) holds all 
the information about an MLA except for the mail 
message bodies. The data structures are designed so 
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that programs that do queries can map the file 
contents into memory in a read-only fashion and do a 
lookup with minimal overhead. Specifically, the 
number of pages that must be touched is typically 
small and no parallel data structures or excessive 
calculations are required of the application as a result 
of the memory-mapped style of accessing the data. 
TOC updates also use a memory-mapped interface to 
the data and are designed to be done in-place for the 
vast majority of updates (when certain data structures 
overflow the database is grown to accommodate more 
information and this operation can be expensive to 
carry out). The efficiency of normal queries and 
updates is important in minimizing TOC lock 
contention when queries collide with updates. 

Each TOC has a fixed size header followed by 
several tables and a string pool. The string pool is a 
data structure that holds all the ASCII strings retained 
from the headers of mail messages stored in the 
archive. Redundant strings and substrings are shared 
so that, for example, subject lines are trivially 
compressed. The other tables in the TOC hold the 
following information: 

* a message descriptor table that has one entry for 

each message in the archive, 

e a thread table that identifies the top-most 

message in each thread of discussion, 

* asorted message table that references entries in 

the message descriptor table sorted by message 
delivery time, and 
a reply spillover table that holds message reply 
information that does not fit in the normal 
message descriptor data structure. 
Tables are initially sized according to a set of 
heuristics and configurable parameters; these 
parameters are stored in the TOC. From that point 
onward the tables are automatically grown whenever 
entering a new message would fill up a table. The 
revised table sizes are calculated to reflect the existing 
usage patterns, with each table guaranteed to be no 
more than 75% full. The initial table sizes are based 
on the following assumptions: 

¢ the number of threads of discussion is at most 
1/3rd the number of messages, i.e., on average 
there will be no more than two replies for any 
message; 
the size of the reply spillover table is 1/6th the 
number of messages, the expectation being that 
most messages get a single reply that can be 
stored directly in the message descriptor; 
the string pool gets an average size message 
(administrator defined) for 1/2th the maximum 
number of messages (since most messages are 
replies there is a significant sharing of strings 


for the subject, user names, mail addresses, 

etc.). 

By default, mlaupdate creates new archives with 
space for 2500 mail messages, and assumes that an 
average message requires 128 bytes of space in the 
string pool. These numbers were chosen empirically. 
Atan average of 100 mail messages per day, these ini- 
tial sizes accommodate a mailing list for 25 days 
before the database must be expanded. After that, the 
database will need to grow about once a week if traf- 
fic pattems remains consistent. 

Each mail message in an MLA is assigned a 
unique number that is used for all references (rather 
than file offsets that would need to be updated when 
data structures grow). Likewise each distinct thread of 
discussion is given a unique number that is used for 
all references. String references are stored as offsets 
into the string pool and must be updated if a string is 
moved as a result of the string pool being compacted. 

The two important pieces of information that 
need to be tracked are the mail message contents 
(headers and body) and the message relationships that 
define the discussion threads. Each message 
descriptor contains references to the strings that make 
up the headers; the message body is looked up 
separately using the unique message number assigned 
when the message is added to the database. Message 
descriptors include references to the message’s parent 
message (if it exists), top-level thread (as stored in the 
global thread table), and a list of messages that are 
considered replies to this message. 

Message replies are handled specially. Each 
message may have a single reply that is stored directly 
in the message’s descriptor. However, if multiple 
replies exist then the reply information in the message 
descriptor points to a list of reply blocks that are 
stored in the spillover reply table. Reply blocks are 
always allocated in the reply spillover table in groups 
of four. If all the replies for a message can be stored in 
the available space then they are; however if there are 
more replies than will fit, three replies are recorded 
and the fourth entry in the block is used to reference 
another block in the table where the remaining replies 
are stored. Figure 5 shows an example where a 
message has five replies. 

The scheme used to manage replies provides an 
efficient mechanism for handling messages with a 
small number of replies, but can incur noticeable 
overhead when a single message has many replies 
since 25% of each reply block in the spillover table is 
spent on linking to the next block. Various schemes 
and reply block sizes were considered before 
choosing this design based on statistics collected from 
a variety of mail archives. 
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] Message 
descriptor 







Spillover reply blocks 


Figure 5.Message reply data structures. Message 31 
has 5 replies so the in-descriptor reply points to a block 
in the reply spillover table. Two 4-element blocks in the 
spillover table are allocated but only five of the six 
possible entries are actually filled in/valid. 


2.2 Message Parsing and Thread 
Construction 


The MLA tools expect mail messages that 
conform to [6]. When a message is to be added to an 
archive, the header information is extracted and the 
body is stored in the message body database. Only the 
interesting header lines in a mail message are 
retained; all others are discarded. Mail message 
bodies can be stored preformatted as HTML or in 
their original form. Storing messages as HTML is 
most efficient if queries are to be done through the 
World Wide Web, as the overhead of parsing the body 
to insert HTML escapes (e.g., “&gt;” for “<‘) and for 
recognizing and inserting URLs can be done once 
(rather than each time the message is retrieved for 
display). It is also possible to retrieve message bodies 
with embedded HTML directives removed, but this 
can remove information if the original message 
included HTML directives. 

The majority of the effort involved in entering a 
Message into the archive is in establishing the 
relationship to other messages; specifically, 
determining if a message is a reply to another 
Message and, most importantly, which one. This 
problem is difficult because there are no standard 
mechanisms used to identify the target message to 
which a message is a reply. Reliable reply information 
can usually be found in the “In-reply-to” and 
“References” lines in a header, but not always. It is 
not uncommon for users to use an existing mail 
message as a convenient means for formulating a new 
posting to a mailing list (to avoid having to enter the 
“To” address for the message); this can cause 
anomalous results in formulating message 
relationships. Also, some mail systems do not include 
either type of reply information, leaving only the hint 
of a relationship in the “Subject” line of a message 
(e.g., a leading “Re:” in the subject information). 
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Reply information is handled by the following 
scheme: 

¢ Look for an “In-reply-to” field in the header. 

¢ If no In-reply-to field is present, use any 
“References” header information. 

¢ If no In-reply-to or References information is 
present, consider the “Subject” line. 

¢ Extract identification information from the 
reply information (In-reply-to, References, or 
Subject). Note that doing this can be 
problematic because the format of these strings 
is not standardized. 

e Search the archive for an existing mail message 
with matching identification and if the parent 
message was identified by something other than 
the Subject line, compare the subject 
information to weed out postings that are false 
relationships (see above). 

This scheme can still link messages into a thread that 
do not belong, but such messages are usually impossi- 
ble to deduce without actually interpreting the mean- 
ing of the message (e.g., someone that uses one 
posting to create an unrelated posting, leaves the sub- 
ject line the same, but the content of the posting is 
totally unrelated to the original). 


2.3 CGI Support and Navigation 


The MLA tools are designed to be used through the 
Computer Gateway Interface (CGI) protocol that is 
used to connect static URLs to programs that execute 
on a server machine. Programs intended for use in this 
way all work as backends to query-style HTML forms 
(i.e., they take a collection of arguments and return an 
HTML document that is the result of applying the 
arguments to an MLA). The format and content of the 
information retumed by a tool is defined by a template 
file that contains HTML directives and, optionally, 
escape codes that are replaced ‘‘on the fly” according 
to the contents of an archive or the result of a query. 
An interesting aspect of the work done in this 
area is how the tools provide good navigational 
support in accessing the messages that comprise a 
query hit set. The basic problem is this: Given a 
query, how can the results be made available toclients 
across the World Wide Web in such a way that the 
context of the query result is maintained across 
multiple HTML pages. This problem is difficult 
because the underlying HyperText Transmission 
Protocol [7] is stateless and so any context that is to 
be maintained across multiple HTTP connections 
must be stored in the client, possibly in reference to 
some state on the server. In this case the state to 
maintain is the result of doing a query of the archive 
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11236 11243 11245 11248 11264 1 1237 11238 1 £239 11240 11253 ... 11271 11273 11272 11278 11279 11280 11281 
p> |! 236 +0 +7 +9 +12 +28 +1 +2 +3 44 +17 ... +35 +37 +36 +42 +43 +44 +45 


Le NHits: 47 MinMsg: 11236 BitRange: 6 MsgDeltas: 0, 7, 9, 12, 28, 1.2, 3, 4, 17, ... 


ee AAMS SmAAACSOBO0SeyJAdIOalDZAc . .-XcKdBgUaq2DgSF6UAAAebv 


Figure 6. Encoding a hit set. The set of message numbers is encoded as an ASCII string for embedding in URLs. First the set 
of numbers are converted to a base value and a set of positive offsets. Then the numbers are encoded as a bit string in which 
the offsets are bit-packed. Finally the bit string is converted to ASCH using a restricted alphabet. 





database. This state can be stored in the client in one 
of several ways: 

* as a query that must be redone each time the 

result is needed, 

¢ asa reference to state cached on the server, 

¢ as the result itself. 
The first two options might be the same if query 
results are cached with the query string used as a key 
to reference cached results. This has the important 
characteristic that if the cached result on the server is 
no longer available then it can be recreated by redoing 
the query using the supplied key/query string. If query 
strings are large however then this scheme can be 
expensive. Also, caching state on the server poten- 
tially requires unlimited server resources to store 
results. Systems such as Alta Vista [11] and 
DejaNews [13] appear cache state on the server. 

The last option, storing the query result in the 
client, works well only if the result is small enough 
that it does not cause a large HTML document to be 
generated. There is also the subtle requirement that 
results may be stored in a client for very long periods 
of thme—months, maybe years—so references encoded 
in the result must work even if the database has 
changed. 

The MLA tools use the third scheme to manage 
query results. The set of message numbers that 
comprise a query hit set are returned to the client in an 
encoded format. This is combined with some carefiil 
usage of the HTML language to minimize the size of 
HTML documents that are retumed as a result of a 
query. 

An encoded hit set is an array of message 
numbers. This information must be returned as an 
ASCII string comprised of a limited set of characters 
that are legal to pass as part of a URL. The set of hits 
is transformed to a 32-bit base message number and a 
set of offsets that are relative to the base message 
number. Offsets are expressed using the minimum 
number of bits required to represent the positive 
values. A fixed header is then prepended to this 
information and the entire result is encoded as ASCII 
using a restricted alphabet. This scheme works 
especially well when queries are constrained to 
reasonable time ranges since message numbers in the 
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hit set are then closely packed. Figure 6 shows an 
example of this procedure. 

The encoded hit set is used by programs like 
mlafetch to generate context-sensitive linkage such as 
the next message in a query result; linkage that could 
not otherwise be done without redoing the original 
query. Note that a side-effect of storing the result in 
the client is that if the contents of the database change 
while the client is navigating an HTML page, then the 
navigated view will remain unchanged even if 
redoing the query might yield a different set of hits. 
This is considered good from the standpoint of a user 
interface designer, though some people might prefer 
the alternative. 


3.0 Performance 


3.1 Space Efficiency 


To save space in the archives, the MLA tools do 

the following when a new message is entered: 

© unneeded mail headers are removed, 

° the message body is stored compressed, 

* common strings in the header are shared 

through the string pool. 

This saved space is offset by the space required to 
store message relationship information and general 
overhead in the TOC database related to efficient 
operation. In addition there is overhead associated 
with the hashed database used to store message bod- 
ies. 

Usage statistics for several archives are show in 
Table 1. The flexfax archive contains messages for 
about three years worth of postings to a fairly quiet 
mailing list (about 10 mail messages a day). The 
majordomo archive contains about four years of 
postings to the majordomo users list. The sgi.bad- 
attitude archive holds about four years of news article 
postings to an internal Silicon Graphics newsgroup. 
The www-vrml archive is about a year and a half of 
postings to the www-vrml mailing list. 

Tests were run on a Silicon Graphics Indy 
workstation running IRIX 6.2 with the XFS 
filesystem (a filesystem that supports files with holes 
in them) and version 0.95 of the MLA tools. The 
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Table 1: Sample MLA statistics. The sizes of data structures are shown as the number of items in use (Total) versus the 
number of items to be filled before the TOC database must be expanded (Max). Spillover files are files created to hold 
message bodies that exceed the maximum size of a message that is stored directly in the message body database. Space 
usage gives the total amount of disk space used by the uncompressed input data (Data) and the final MLA (the TOC 
database, the message body database, and the spillover files). 








hashed message body database files were created with 
version 1.85 of the Berkeley DB software. Other than 
a 3 kilobyte threshold for creating spillover files the 
default parameters were used in creating the archives. 

The implementation uses a 56 byte message 
descriptor and 32-bit numbers for all thread and 
message numbers. The fixed size header at the front 
of the TOC is about 100 bytes. This results in TOC 
files that are very small in comparison to the space 
used to store message bodies; c.f. Table 2. 

Two specific areas of space usage in the TOC 
were examined: the string pool and the reply spillover 
table. Table 3 summarizes statistics for the sample 
archives related to these two areas. The string pool 
was generally very effective in saving space; on 
average nearly half the space required to store the 
header strings was saved by sharing strings. This 
reflects the fact that most postings have at least one 
reply and so the subject strings in the messages can be 
shared. The morc replies there are to a posting, the 
more sharing that will take place. Note however that 
optimal sharing requires periodic off-line compaction 
of the string pool since reply messages 
chronologically follow the initial posting so the 
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Table 2: Sample MLA space usage. All numbers are 
megabytes of disk space allocated (no holes). Input 
tefers to the raw input data. TOC is the table of 
contents database. Message body data is broken up into 
to the space used by the hashed database file and 
separate spillover files. 
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subject string in the original posting cannot be 
compacted until after a reply is received. 

The sample archives examined had very few 
unspecified header lines. As a result there were few 
null strings in the archive so the default string sharing 
for this special case was ineffective. This result would 
likely change if more header information were 
retained in the TOC database. 

The overhead for the linked list scheme for 
storing reply information is more than 30%. That is, 
more than 30% of the space allocated to storing reply 
information is either used for links to following 
blocks in the spillover table or not used at all (i.e. 
unused entries in the last block of a list). This number 
indicates that most reply spillover blocks are fully 
populated since if every block were filled this figure 
would be approximately 25%. While the overhead 
may seem high, the actual amount of storage 
associated with this waste is relatively insignificant in 
the overall totals. An alternate scheme that allocates 
contiguous arrays in the spillover table might be more 
efficient, but would significantly increase the 
complexity of the software. 


Null Reply 
Strings 


aechive Overhead 


flexfax 
majordomo 


sg.bad-attitude 





www-vrm]l 


Table 3: Sample MLA space efficiency. String 
Compression is the ratio of the space required to store 
strings without the string pool to storage with the string 
pool. Null Strings is the count of empty strings. Reply 
Overhead is the percentage of reply entries not used or 
used to link blocks in the spillover table. 
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3.2 Update Performance 


Table 4 shows results for some sample update 
tests. The tests were all run on an SGI Indy system 
running IRIX 6.2 with the archive stored on an XFS 
filesystem. The machine had a 133MHZ R4600SC 
with 96MBytcs of memory. 

The flexfax, majordomo, and www-vrml archives 
were built from several gzip-compressed files that 
each contained multiple messages. The raw data for 
the sgi.bad-attitude archive had each message in a 
separate file; this is morc indicative of normal usage 
where mlaupdate processes mail messages as they are 
received. In all test cases the average real time to 
process a message is less than a tenth of a second; this 
is consistent with the goal to keep update time small 
so that lock contention on the MLA is minimized for 
interactive users doing queries. 

All the work that involves the table of contents 
database is done by directly manipulating a memory- 
mapped copy of the TOC file. The most expensive 
work is determining the thread relationship. If the 
Message appears to be a reply then it may be 
necessary to search backwards chronologically over 
message headers to find a message’s parent message. 
In normal operation adding a message to an archive 
involves touching only afew pages of the TOC file. In 
the worst case most of the string pool and message 
descriptor table must be touched. The sgi.bad-attitude 
test shows the most pronounced effect of this 
requirement; many messages in this archive arc 
replies and the update time is noticeably higher than a 
similar-sized archive such as flexfax. 

As noted in Section 2.1, allocating entries in the 
various TOC tables is typically inexpensive; just 
incrementing a counter in the TOC header. When the 
addition of a message would cause one of these tables 
to overflow, all the tables are expanded in-place; this 
operation can take several seconds depending on the 
characteristics of the underlying system (when there 
is sufficient memory to cache the contents of the TOC 
the grow operation can happen very quickly). 


aoe Total Time 
rchive §,_——— 
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Table 4: Build times for sample MLAs. All 
times are in seconds. 





One thing to note about the operation of the 
mlaupdate program is that it is faster to process 
multple messages together than to add each 
individually. This is because mlaupdate builds hash 
tables for message header strings to use in 
determining message relationships. When mlaupdate 
is given one message at a time to add to an archive it 
is likely these tables will be built multiple times 
(whether or not there is overlap depends on whether 


the messages are related). 


3.3 Query Performance 


MLA queries are done as a result of URL lookups 
that cause the MLA tools to be invoked as CGI 
applications. Consequently user perception of query 
performance is dominated by the overhead associated 
with the HTTP client and server programs and the 


underlying network performance. 


Table 5 shows the average time for several 
exemplary query operations. These numbers were 
collected by invoking the MLA programs from the 
command line, using the same method by which they 
are invoked as CGI applications (i.e. passing 
arguments through environment variables). The 


results of the query were sent to the null device; e.g., 


mlaquery ~h 1000 >/dev/null 


Test runs were repeated 10 times each; the reported 
times are the average of these runs. No other queries 
were done in between so subsequent queries should 
find much of the information in the system buffer 
cache. Using a warm cache is representative of nor- 
mal usage; if users do repeated queries they are typi- 


cally done to narrow the hit set. 
The test query operations are: 
¢ 1000 items: return the most recent 1000 
messages in the archive, no constraints are 
applied 
e 5000 items: the same as 1000 items, but retum 
five times aS many messages 






Test Query 








os re 


Table 5: Sample MLA query tests using the 
majordomo archive. All times are in seconds. 








USENIX Association 1997 Annual Technical Conference 


215 


216 


¢ date range: retum the first 200 messages 
between 1/1/95 and 1/1/96 

¢ match | re: constrain the date range query to 
messages with “BOUNCE” in the subject 

¢ match 2 re: constrain the match 1 re query to 
messages sent by users with “Brent” in their 
name (case sensitive) 

Constraining a query to a date range is fast 
because the TOC includes table of messages sorted by 
date. Any other constraints require mlaquery to apply 
a regular expression package to header strings that are 
taken from the string pool. This results in an increase 
in the query time that is typically proportional to the 
length of the strings that are checked. Note that 
mlaquery first applies any date range constraint before 
doing regular expression matching so any cheap 
culling of the hit set is done before the more 
expensive pattern matching. 

One should note the 5000 items test result; the 
time to carry out this query is not a linear product of 
the number of items to return. This is because the 
HTML document generated by the query is not 
proportional in size because the encoded hit set 
returned in the URLs is significantly larger (due to the 
encoding scheme). The running time of mlaquery is 
dominated not by the time to do the query but instead 
by the time to generate the resulting HTML 
document. This is an example of the limitation of the 
scheme by which query context is stored in the client. 


4.0 Summary and Future Work 


The Mailing List Archive (MLA) tools provide 
an efficient system for storing and retrieving mailing 
lists. The ability to access and navigate archives via 
the World Wide Web is valuable; providing high 
quality graphical user interfaces basically for free. 
The underlying database design for the table of 
contents and message bodies has worked out well. 
The scheme for storing query context on the client 
side of an HTTP connection has proven successful so 
long as the hit set size is constrained to be reasonably 
small (several hundred). 

The design and implementation of the MLA tools 
was done in early 1994. At that time, they were 
intended mainly to support a few private mailing lists; 
since then they have been used in a variety of settings 
with good results. The two main enhancements that 
have been requested are: support for searching 
multiple archives and searching the body of mail 
messages. 

The first request-to search multiple archives— is 
contrary to the original design. In most cases, what 
people really want is not the MLA tools but the user 
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interface and the ease of access that are provided by 
the tools. 

The second request for searches over the body of 
mail messages has been added in an experimental 
version of the tools. Support was added to 
automatically construct inverted keyword indices 
from the entire text of each mail message. This new 
functionality permits queries to be formulated that 
search more than just the headers of a mail message. 
However, the indices are larger than desired (with 
only minimal filtering of uninteresting words they 
take up about equal space to the compacted message 
bodies) and compacting them opens up the usual set 
of issues: one can index fewer words by applying 
relevancy information to weed out unimportant 
information, Or one can use data structures that 
support two-level queries such as those used in 
glimpse [2]. The indexing support was added in a way 
that makes it easy to evaluate alternate indexing 
techniques. This area is the subject of ongoing work. 


5.0 Bibliography 


[1] Kevin Hughes. Hypermail 1.02. 
http://www.eit.com/goodies/software/hypermail/ 
hypermail.html. 

Udi Manber, Sun Wu. “GLIMPSE: A Tool to 

Search Through Entire File Systems”. TR 93-34. 

Department of Computer Science, University of 

Arizona. October 1993. 

http://glimpse.cs.arizona.edu: 1994/. 

The RAND MH Message Handling System. UCI 

Version 6.8.3. October 29, 1995. 

[4] NI, Usenet search gateway by Mike Burrows. 

[5] Stephen R. van den Berg. Procmail & formail 
mail processing package. 
ftp://ftp.informatik.rwthaachen.de/pub/packages/ 
procmail/procmail.tar.gz. 

{6] D. Crocker, “Standard for the format of ARPA 


[2 


— 


[3 


“a 


Internet text messages”, August 13, 1982. 
Updated by RFC1327, RFC0987. 
ftp://ds.internic.net/rfc/rfc822.txt. 

[7] T. Berners-Lee, R. Fielding, H. Nielsen, 


“Hypertext Transfer Protocol -- HTTP/1.0". May 
17, 1996. ftp://ds.internic.net/rfc/rfc1945.txt. 
(8] L. Deutsch, “DEFLATE Compressed Data 
Format Specification version 1.3’. May 23, 1996. 
ftp: //ds.internic.net/rfc/rfc1951.txt. 
Margo Seltzer, Ozan Yigit. “A New Hashing 
Package for UNIX”. USENIX Technical 
Conference. January 1991. Dallas, Texas. pp. 
173-184. 
[10]Jean-loup Gailly, 
compression 


[9 


“= 


Mark Adler. zlib data 


library. 


USENIX Association 


USENIX Association 


ftp://ftp.uu.net/pub/archiving/zip/zlib/zlib0.99.tar 


22. 
(11] Digital Equipment Corporation. “AltaVista 


Search”. July Zl; 1996. 
http://altavista.software.digital.com/products/sea 
tch/whitpapr/. 


[12]NCSA. “The CGI Specification, Version 1.1”. 
http://hoohoo.ncsa.uiuc.edu/cgi/interface.html. 
[13] Deja News. http://www.dejanews.com/. 


Sam Leffler is a member of the corporate research 
group at Silicon Graphics where he works on a variety 
of projects. His current focus is on processor schedul- 
ing and operating system support for high perfor- 
mance multi-threaded applications. Previous work has 
spanned the gamut from 3D graphics to computer net- 
working to programming languages. He received an 
M.S in Computer Science and a B.S. in Mathematics 
from Case Westem Reserve University, both in 1980. 


Melange Tortuba is president of Tortuba Consulting, 
a small thinktank located in Berkeley California. She 
spends most of her time researching the needs and 
activities of overly-pampered domesticated felines. 
Her most well-known accomplishments to date have 
been in non-computer fields. She was educated at the 
University of California at Santa Barbara where she 
majored in mooching attention from students and fac- 
ulty. 


1997 Annual Technical Conference 





217 


USENIX Association 


Experiences with GroupLens: Making Usenet Useful Again* 


Bradley N. Miller 
John T. Riedl 
Joseph A. Konstan 


Department of Computer Science 
University of Minnesota 
email: {bmiller,riedl,konstan}@cs.umn.edu 


Abstract 


Collaborative filtering attempts to alleviate in- 
formation overload by offering recommenda- 
tions on whether information is valuable based 
on the opinions of those who have already eval- 
uated it. Usenet news is an information source 
whose value is being severely diminished by the 
volume of low-quality and uninteresting infor- 
mation posted in its newsgroups. The Grou- 
pLens system applies collaborative filtering to 
Usenet news to demonstrate how we can re- 
store the value of Usenet news by sharing our 
judgements of articles, with our identities pro- 
tected by pseudonyms. 

This paper extends the original GroupLens 
work by reporting on a significantly enhanced 
system and the results of a seven week trial 
with 250 users and over 20,000 news articles. 
GroupLens has an open and flexible architec- 
ture that allows easy integration of new news- 
reader clients and ratings bureaus. We show 
ratings and prediction profiles for three news- 
groups, and assess the accuracy of the predic- 
tions. 


1 The Problem with Usenet 
Today 
1.1 Problem Statement 


The information super-highway promises to de- 
liver more information more rapidly than was 


“Thanks to AT&T Research for their generous sup- 
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ever before possible. However, many of us are 
already overwhelmed with the amount of infor- 
mation we must process each day. The prob- 
lem of information overload leaves us unable 
to keep up with the information we need. To 
long-time readers of Usenet news this problem 
is especially evident and has caused many users 
to abandon Usenet altogether. How did we get 
into this predicament? 

The Internet was born in the 1970’s by a 
group of like-minded scientists who used the 
net primarily to serve the interests of research 
and academia. This community of on-line pio- 
neers thrived for about 20 years until the rush 
to the net began in the early 1990s. With this 
rush came millions of new users with interests 
that went way beyond the research questions 
that tied the early community together. Not 
only did the new users increase the volume 
of information on the net, they fundamentally 
changed the culture. What once felt like a 
small community now feels like a loud imper- 
sonal city. 

The current estimate of Usenet volume is 21 
million users posting 130,000 articles per day. 
This is up from an estimate of 10,000 articles 
per day in January 1994. The growth of the 
Web is even more phenomenal with current es- 
timates that the the size is doubling every 4 
months, in terms of both traffic and the num- 
ber of sites. 

The GroupLens project seeks to alleviate 
the problem of information overload by apply- 
ing collaborative filtering techniques to Usenet 
news and other Internet resources. In so do- 
ing we hope to help restore order to Usenet, 
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and build a renewed sense of community. In 
an earlier paper [10] we reported on the initial 
GroupLens architecture and a small scale pilot 
test with approximately 12 participants at our 
local universities. In this paper we report on 
the new GroupLens architecture, including the 
newly published open protocols, and a larger 
scale Internet wide user test involving more 
than 250 participants and tens of thousands of 
ratings. The next sections describe some of the 
past and current strategies for fixing Usenet. 
Section 2 presents the GroupLens server archi- 
tecture. Section 3 discusses the client library 
and the adaptation of newsreaders to support 
GroupLens. Section 4 presents data gathered 
and lessons learned from the seven week user 
trial, and section 5 discusses future work and 
presents some conclusions. 


1.2. Non-collaborative Solutions 


Since the early days of Usenet people have 
tried to find ways to reduce the number of 
messages they must process each day. In 
this section we consider non-collaborative so- 
luttons, which are solutions that use only in- 
formation from a single user. Some of the 
earliest non-collaborative techniques relied on 
matching keywords in the header fields of the 
news articles. Later, more sophisticated key- 
word matching techniques emerged that ap- 
plied techniques from information retrieval. 


Kill files / Score files One of the first meth- 
ods introduced to the Usenet community to re- 
duce the noise level was the killfile. Recently 
scorefiles have been introduced as a more gen- 
eral mechanism. A killfile allows a user to spec- 
ify certain subjects or authors that he never 
wants to see, while a scorefile allows the user 
to give interesting subjects and authors high 
scores and uninteresting subjects and authors 
low scores [5]. The problem with these tech- 
niques is their coarseness. Not all articles 
containing a desired keyword are interesting, 
and even generally poor writers occasionally 
produce an article worth reading. Addition- 
ally, keywords are difficult to identify in the 
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presence of aliases, synonyms and misspelled 
words. 


Moderated Newsgroups Another ap- 
proach to reducing the noise level on Usenet 
is the creation of moderated newsgroups. 
In a moderated newsgroup one person, the 
moderator, must approve each article before 
it is distributed throughout Usenet. The 
moderator is responsible for rejecting articles 
that are off-topic, inflammatory, or generally 
of poor quality. The problem with moderated 
groups is that they require a large time 
commitment from the moderator, and the 
quality judgment is left up to a single person. 


Programmable Agents Programmable 
agents are simple programs that perform 
actions on behalf of users. Programmable 
agents have been used in information filtering 
to prioritize messages, gather messages into 
folders by keyword, or even reply to mes- 
sages. For instance, the Information Lens 
system enables even unsophisticated users 
to automatically perform actions in response 
to messages [6]. Object Lens extends the 
Information Lens to other domains, including 
databases and hypertext [3]. 


Intelligent Agents The July 1994 issue of 
Communications of the ACM is devoted to the 
state of the art in intelligent agent research. 
This issue includes discussions of several agents 
designed to reduce information overload. [4]. 
The agents address meeting scheduling, email 
handling, and netnews filtering. The netnews 
agent is known as NewT [14]. A user trains 
NewT by showing it examples of articles that 
should and should not be selected. The agent 
performs a full text analysis of the article using 
the vector-space model [11]. Once the agent 
has gone through initial training it starts mak- 
ing recommendations to, and accepting feed- 
back from the user. Based on user feedback 
NewT is able to make weighted judgments 
about news articles containing keywords. In- 
telligent agents for information filtering suffer 
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from the same drawbacks as keyword based 
techniques. An additional problem is that 
agents must be trained. Norman points out 
in [9] that interaction with and instruction of 
agents is a difficult problem that has not been 
solved satisfactorily. 


1.3. Collaborative Solutions 


Collaborative filtering systems make use of the 
reactions and opinions of people that have al- 
ready seen a piece of information to make pre- 
dictions about the value of that piece of infor- 
mation for people who have not yet seen it. 
Collaborative filtering is already used heavily 
in informal ways. Users regularly forward arti- 
cles, or references to articles, to their friends 
and colleagues with the explicit or implicit 
message: “You will like this.” Collaborative fil- 
tering systems atttempt to formalize this pro- 
cess to more effectively incorporate a larger set 
of users and data. The utility of collaborative 
filtering extends beyond the domain of Usenet 
news into the realm of movies, videos [2], and 
audio CDs [13]. 

Usenet represents a uniquely challenging 
problem for a collaborative filtering system be- 
cause of the sheer volume of information items 
for which ratings must be collected and pre- 
dictions calculated. The 130,000 new messages 
produced on Usenet each day dwarfs the num- 
ber of new CDs and movies produced in an 
entire year. 


Tapestry The Tapestry system [1] is an 
early collaborative filtering system designed 
to help small groups of people work together 
to solve the information overload problem. 
Tapestry makes sophisticated use of subjective 
evaluations. It allows filtering of all incom- 
ing information streams, including email and 
Usenet news. Many people can post evalu- 
ations and users can choose which evaluators 
to pay attention to. The evaluations can con- 
tain text, not just a numeric rating or boolean 
accept/reject. In the Tapestry system users 
can combine keyword criteria, along with sub- 
jective criteria to form requests. An exam- 
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ple request might be “Give me all the articles 
containing the word collaborative that Pat has 
evaluated and where the evaluation contains 
the word ezcellent.” Tapestry works well in a 
close-knit community with common interests. 
GroupLens extends the concepts in Tapestry 
system in two ways. First, GroupLens pro- 
vides predictions based on the aggregation of 
ratings entered by other users. Second, Grou- 
pLens does not require the user to know whose 
evaluations to use in advance. 


NoCeM NoCeM (no see ’em) is a system 
that makes it possible for anyone to attempt 
to cancel an article that is widely cross-posted 
or seen as a blatantly commercial posting. Un- 
der the NoCeM model, any person on the net 
who sees something they think shouldn’t have 
been posted can issue a NoCeM notice. How- 
ever, just as with any other type of Usenet 
message, the weight the notice carries will be 
no greater than the poster’s net.reputation. If 
people agree with the issuer’s criteria and also 
feel that this person is a good judge of that 
standard then they will accept his/her notices. 
When a NoCeM notice is accepted by a user 
it will typically mark the message as read in 
the user’s newsrc file. NoCeM notices could 
instead be used to remove the message from 
the local spool, thus keeping all users on the 
local system from seeing the article. 


1.4 The GroupLens Approach 


GroupLens is a collaborative filtering system 
for Usenet news. The aim of GroupLens is 
to help people work together to find articles 
they will like in the huge stream of available 
Usenet articles. In effect, GroupLens auto- 
matically selects for you a group of people to 
act as your personal moderators for a given 
newsgroup. These moderators are selected by 
finding people with whom you have had sub- 
stantial agreement on past articles. Users can 
ensure their privacy by entering ratings under 
pseudonyms without reducing the effectiveness 
of the predictions. 

Usenet news readers can take advantage of 
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GroupLens by reading news with a GroupLens- 
aware news client. We provide several clients 
and make a library available for newsreader 
authors who want to make their newsreaders 
GroupLens-aware. An example client is shown 
in figure 1. The newsreader connects to the 
user’s local NNTP server to retrieve Usenet 
news articles, and it also connects with the 
GroupLens server to share filtering informa- 
tion. Whenever the user fetches articles from 
a newsgroup, the news reader sends a message 
to the GroupLens server requesting predictions 
of how the user will value each article. Those 
predictions are displayed alongside the article 
titles as a bar. When reading news articles, 
the user may enter ratings of the actual value of 
each article. Those ratings are sent back to the 
GroupLens server to serve as input for other 
users’ predictions and to update the correla- 
tions between this user and other users. The 
more one uses the system, the more data there 
is upon which to base predictions. 

We believe that GroupLens provides the best 
opportunity for managing the overwhelming 
amount of data in Usenet news. In addi- 
tion to the general benefits of collaborative fil- 
tering over non-collaborative solutions, Grou- 
pLens has a scalable open architecture that can 
support a large number of users and data ele- 
ments. The GroupLens architecture also can 
support a variety of algorithms for collabo- 
rative filtering, allowing system designers to 
trade off between efficiency of calculation, stor- 
age requirements, and the degree of personal- 
ization of predictions. 


2 The 
ture 


GroupLens_ Architec- 


2.1 Overview 


At the heart of GroupLens lies the GroupLens 
Ratings Bureau (GLRB) which functions as a 
request broker for the distributed collaborative 
filtering engine. To implement the request bro- 
ker we have adopted a “process pool” model 
that allows the GLRB to identify incoming re- 
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quests and hand them off to the appropriate 
background daemon. To keep the number of 
network connections low we have adopted a vir- 
tual session model for each client connection. 
Once a client as logged in the client is given a 
session token that is valid until the client logs 
out or becomes inactive. 

In figure 2 we see two newsreading clients in 
different stages of communication with Grou- 
pLens. The tin client is in the process of es- 
tablishing a connection with the GLRB,; its re- 
quest has not yet been determined. On the 
other hand the GLRB has already assigned the 
xrn client to the appropriate prediction dae- 
mon. 

The filtering engine contains four primary 
modules. A prediction module, a ratings mod- 
ule, a correlation module, and a data manage- 
ment module. The GroupLens architecture is 
open, and the modules communicate with each 
other through a well defined protocol. This al- 
lows any of the modules to be replaced by a 
functionally equivalent module so long as the 
new one conforms to the protocol. 

The primary goal of the collaborative filter- 
ing engine is to provide clients with accurate 
predictions quickly. Predictions are calculated 
by one of the daemons in the prediction dae- 
mon pool. To calculate a prediction for an arti- 
cle the prediction daemon requires two inputs: 
a measure of similarity between pairs of users, 
and ratings for the article in question. The 
way in which these two inputs are combined 
for each prediction is described in [13, 10, 7]. 
Our performance goal for the the prediction al- 
gorithm is to be able to calculate and deliver 
100 predictions in less than two seconds. In 
practice we are able to deliver 100 predictions 
in 4.2 seconds on a Sparc 5 workstation with 
32Mb memory, running Solaris 2.4. 

Pairwise similarity between users is deter- 
mined by the correlation program. Similarity 
between two users is determined by how they 
have rated articles in the past. Because user’s 
correlations change relatively slowly, and be- 
cause newsreading tends to be a daily activity, 
the correlation program is run once a day. 

Ratings for Usenet articles are received in 
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Sb ject: Popeyes Fawous Fried Chicken 
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Organization: Onraus 
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>From the book written by Todd Wilbur. 


6 cups vegetable oil 

2/3 cup all-purpose flour 

1 tbls salt 

2 tbls white pepper 

1 tsp cayenne pepper 

2 tsp paprika 

3 eggs 

2 frying chicken w/skin, cut up 


Heat the oil over medium heat in a deep fryer or in a wide, deep pan 
an the stove. In a large, shallow bowl, combine the flour, salt, peppers, 
and paprika. Break the eggs into a separate shallow bow] and beat 
until blended. Check the oil by dropping in a pinch of the flour mixture. 
If the oi] bubbles rapidly around the flour, it i ready. Bip each piece of 
chicken into the eggs, then coat generously with the fiour mixture. 
Drop each piece into the hot oi) and fry for 15 to 25 minutes, or until 
it is a dark golden brown. Remove the chicken to paper towels or a rack 
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Figure 2: GroupLens Architecture Overview 
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batches from the newsreader client as defined 
by the GroupLens protocol. It is the responsi- 
bility of the ratings daemon to receive ratings 
from the clients as fast as possible and to en- 
sure that the ratings are eventually stored in 
the ratings database. Our performance goal 
for ratings is to be able to accept 100 ratings 
in less than 1 second. In practice we are able 
to accept 100 ratings in less than 0.5 seconds. 

The data management subsystem is respon- 
sible for maintaining both the ratings and cor- 
relations databases. Logically you can think 
of the ratings database as a large matrix orga- 
nized with message-ids indexing the columns, 
and pseudonyms indexing the rows. The cur- 
rent database for the rec.humor newsgroup has 
20,837 columns and 74 rows for a total of 
1,541,938 cells. However, only 30,537 or 0.01% 
of these cells are occupied. We have devised a 
storage mechanism that minimizes access time 
for an individual cell, and minimizes the space 
required to store the matrix. 

We have designed our data management in- 
terface so that we can plug in one of several 
DBMSs on the back end, while maintaining a 
consistent interface on the front. Currently we 
support three back ends: gdbm [8], Illustra, 
and OBST. 


2.2 Protocol 


The glue that ties all of the modules together, 
and allows the newsreading clients to talk to 
the GLRB and other modules, is the Grou- 
pLens protocol. The protocol consists of five 
major commands. We'll give a brief overview 
of the major commands here. The details of 
the protocol are available on-line [15]. 

Three key concepts in the protocol are 
pseudonyms, tokens, and  message-ids. 
Pseudonyms are the secret identifiers se- 
lected by users to identify themselves to the 
GroupLens system while maintaining their 
privacy. Tokens are integers returned from 
the server to represent the state of a logged 
in user. ‘The server maintains just enough 
state for each token so the user does not 
have to authenticate herself each time she 
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requests predictions or submits ratings. The 
server discards tokens after a timeout period. 
Message-ids are used by the clients to identify 
items they wish to rate or get predictions 
for. Message-ids in the Usenet trial are the 
standard Usenet message identifiers used by 
news clients to identify messages to the news 
server. 


Register In figure 2 we see a World Wide 
Web client talking to the GLRB from the 
GroupLens Registration page. The for- 
mat of a register command is register 
pseudonym. When the GLRB receives 
a register command, it checks the user 
database to make sure that the given 
pseudonym does not already exist. 


Login The format of the login command is 
login pseudonym. When a login request 
is received the GLRB checks to see if the 
pseudonym is valid. If the pseudonym is 
valid the client is given a session token. To 
increase security a password may be op- 
tionally supplied as part of the login com- 
mand. 


Logout The format of the logout command is 
logout token. When a logout command 
is received the token is removed from the 
list of active tokens, and the token number 
is invalidated. Any future requests using 
this token number will be refused. 


GetPredictions The format of the getpredic- 
tions command is getpredictions token 
newsgroup, followed by a list of Usenet 
message-ids. When the GLRB sees that 
the request is getpredictions it validates 
the token and newsgroup name, and then 
passes the request to a free prediction dae- 
mon. The prediction daemon reads the list 
of message-ids and returns either a predic- 
tion or a keyword indicating no prediction 
for each message-id. 


PutRatings The format of the putrat- 
ings command is putratings token 
newsgroup, followed by a list of tuples 
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that contain the message-id and rating. 
The ratings daemon simply reads the list 
of tuples and informs the client that it 
has received them. When it has time, 
the ratings daemon writes the list to the 
database. 


2.3. Privacy 


Privacy is an important issue in a large scale 
collaborative filtering system. There are three 
ways to handle user privacy issues in a col- 
laborative filtering system. First, users may 
be anonymous so that ratings are submitted 
without any user identification. When ratings 
are submitted anonymously, the only opera- 
tions the system can perform are aggregate op- 
erations such as the average rating [7]. Sec- 
ond, users may be known to all other users. 
In this case the ratings are closely associated 
with the reputation of the rater, and users 
seeking recommendations or predictions may 
specify which other users to use in generat- 
ing predictions [1]. This option requires users 
to give up the privacy of their ratings. The 
third option, employed by GroupLens, uses 
pseudonyms to uniquely identify every user. 
Using pseudonyms allows ratings to be asso- 
ciated with a user, and allows predictions to 
be customized for users based on their corre- 
lation with other pseudonyms. In GroupLens, 
pseudonyms and their associated ratings are 
publicly available. However, we do not as- 
sociate these pseudonyms with a user’s real 
identity and we use an authentication protocol 
[12] that prevents a user from using another’s 
pseudonym. 


3 Filtering Clients 


3.1 News Readers 


The primary user interface for the GroupLens 
system is a set of newsreaders that are adapted 
to use the GroupLens server as well as the lo- 
cal NNTP server. We currently support three 
Unix-based newsreaders: xrn, tin, and gnus. 
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We are in the process of adapting newsreaders 
for the PC and Macintosh platforms. 

To simplify the process of adapting news- 
readers, we have implemented and freely pro- 
vide a GroupLens client library. This library 
handles all GroupLens server communication, 
manages local configuration files, and provides 
data structures to simplify the integration of 
ratings and predictions into an existing news 
reader. When using the library, the newsreader 
author is freed from the details of logging into 
the GroupLens server and maintaining a token, 
and from the details of the GroupLens proto- 
col. 

Adapting a newsreader to use GroupLens in- 
volves three steps: 


1. Passing the set of article IDs to the client 
library (to retrieve predictions) when re- 
trieving article headers for a newsgroup. 


2. Recording article information, including 
user ratings, in the GroupLens article ta- 
ble, and calling the library routine to sub- 
mit these ratings after finishing each news- 


group. 


3. Defining a user interface for displaying 
predictions (some support is in the li- 
brary) and for receiving ratings from 
users. 


Our experience modifying newsreaders has 
shown that the interface changes are the hard- 
est part of the process. Many newsreaders use 
nearly every key on the keyboard, and con- 
sequently require creative interface design to 
maintain a consistent interface. While dis- 
playing predictions was somewhat simpler, we 
have found that some news reading models 
are not as amenable to selection by title and 
prediction, and accordingly plan to investigate 
methods for providing summary predictions for 
threads (perhaps at the client interface). 

We were able to add GroupLens support to 
xrn with less than 1000 additional lines of code. 
These lines represent less than 3% of the total 
xrm source code. 
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3.2 Filter Bots 


We use the name filter-bot to refer to simple 
filter programs that algorithmically (“roboti- 
cally”) supply useful information to a filtering 
system. In GroupLens, a filter-bot is a pro- 
gram that assigns a rating to a Usenet arti- 
cle based on some simple computable criteria. 
Filter-bots are implemented using the client li- 
brary and enter ratings under their own unique 
pseudonyms. This allows real users to weigh 
the ratings of the filter-bots along with rat- 
ings from other users. Examples of filter-bots 
we have implemented include an article length 
filter-bot and an excessive quoted text filter- 
bot. We intend to implement several more in- 
cluding a reading level filter-bot?, a prolific au- 
thor filter-bot, and an excessive cross-posting 
filter-bot. 

Filter-bots measure syntactic features of ar- 
ticles, providing additional ratings with which 
users can correlate, potentially improving pre- 
dictions. Filter-bots facilitate incorporating 
new information filtering algorithms into the 
GroupLens architecture. Filter-bots also miti- 
gate the first rater problem, which stems from 
the fact that in order to compute a prediction 
for an article at least one previous rating must 
be available. Filter-bot ratings for an article 
can be computed immediately, so they are al- 
ways available for users. 


4 Experiences 


We now turn to the results of a user trial 
we conducted to test the GroupLens archi- 
tecture. The user trial began February 8, 
1996 when we posted an announcement to 
the comp.os.linux.announce newsgroup. In 
order to participate in the trial, users had 
to be willing to use one of the newsread- 
ers that had been enhanced with GroupLens 
support. The newsreaders available were 
gnus-5.1 (for emacs), tin, and xrn._ Par- 
ticipants also had to be willing to read and 


1A reading level filter-bot rates articles according to 
the minimum grade level for which the vocabulary and 
sentence structure would be appropriate. 
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rate articles in one of our supported news- 
groups. ‘The groups supported for the trial 
included the entire comp.os.linux hierarchy, 
rec.humor, rec.food.recipes, comp.lang.c++, 
comp.lang.java, and rec.arts.movies.current- 
films. Participants were told to rate articles 
according to the following definitions: 


1. This article is really bad! a waste of 


net.bandwidth. 
2. This article is bad. 
3. This article is neither good nor bad. 
4. This article is good. 


5. This article is great, I would like to see 
more like it. 


Through the first seven weeks of the trial, 
we had 250 users register to use GroupLens. 
The total number of ratings received during 
the seven week period was 47,569. These rat- 
ings are spread over a total of 22,862 distinct 
messages. Over the same seven week period 
GroupLens provided over 600,000 predictions 
to users. This ratio of ratings to predictions 
is appropriate for a noisy domain like Usenet 
news, since it suggests that ratings help users 
choose which items to review. 

In the next two sections we take a look at the 
general question, “does it work?” We'll look at 
this question from two perspectives: First, do 
the predictions accurately reflect what the user 
rated the article? Second, do users find the pre- 
dictions useful, and do they believe them? We 
used data collected from the earlier GroupLens 
trial [10] to develop prediction algorithms, and 
evaluated the performance of the prediction al- 
gorithms based on the data from the present 
trial. The algorithms were not changed during 
this trial. 


4.1 Accuracy of Predictions 


Our experience has shown that the predic- 
tion program behaves differently for differ- 
ent newsgroups. To study this point, we 
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will examine the accuracy of the predic- 
tion program for three representative news- 
groups. rec.humor, rec.food.recipes, and 
comp.os.linux.development.apps 

For each of the newsgroups we will com- 
pare the accuracy of predictions for two dif 
ferent ways of calculating the predictions. We 
will calculate a personalized prediction for each 
user for each article using the Pearson coefh- 
cient as a similarity measure between users, as 
described in [10]. For comparison we will cal- 
culate the average rating entered for each ar- 
ticle. We compare each prediction against the 
actual ratings entered by the users. It is useful 
to look at the average because it is fast to cal- 
culate and requires very little storage. On the 
other hand, the average does not allow for any 
personalization of the ratings. 

The metrics we will use to measure the ac- 
curacy of the algorithms include the mean 
squared error E?, The mean absolute error |E]|; 
the standard deviation, a, of |E|; and the Pear- 
son correlation coefficient r between ratings 
and predictions. 

To measure the mean absolute error we take 
the absolute value of the difference between 
the actual rating entered by the user and the 
prediction computed by the algorithm for each 
rating/prediction pair, and compute the mean 
of all of the differences. The lower the mean 
absolute error, the better the algorithm. The 
standard deviation of the error is a measure 
of how consistently accurate the algorithm is. 
One problem with using the mean absolute 
error is that it does not sufficiently penal- 
ize algorithms that make large errors. The 
mean squared error, like least squares regres- 
sion, disproportionately penalizes algorithms 
that make large errors more than small. We 
want to penalize large errors because users 
probably don’t distinguish between a predic- 
tion of 1.5 and 2.0. On the other hand, users 
will notice if an algorithm predicts something 
to be a 4.0 that should really be 2.0. 

As mentioned in section 2, the prediction al- 
gorithm requires a measure of similarity be- 
tween pairs of users (correlation), and ratings. 
The nature of the newsgroup appears to have 


1997 Annual Technical Conference 


an effect both of these factors. In figure 3 we 
show the rating profiles for all groups com- 
bined, and for the three newsgroups. 

In rec.humor, 83% of the ratings are 1 or 
2. This reflects the paucity of funny articles 
and the overabundance of name-calling, flam- 
ing, and completely silly discussions of World 
War II. Rec . humor is a good example of a news- 
group where there is a clear metric for deter- 
mining a rating: “Is it funny?” The fact that 
there is a clear metric for judging each article, 
and the fact that there is so much noise leads 
to a high level of correlation between pairs of 
users. This is illustrated in figure 4 where we 
can see that most pairs of users have a high 
positive correlation. 


method [2 [Bl Jo [r_| 
0.63 | 0.88 | 049 | 
personalized | 0.94 | 0.67 | 0.68 | 0.62 | 
faltones 2.01 [078 [1 










Table 1: Summary of Results in Rec.humor 


In table 1 we see the comparison be- 
tween average and personalized predictions for 
rec.humor. Because there is such a high de- 
gree of correlation between users, we see that 
the average is slightly better than the person- 
alized algorithm in terms of the mean absolute 
error. One might think that given the ratings 
profile for rec.humor the best strategy to min- 
imize error would be to simply predict 1 for 
every article. The row called “all-ones” in ta- 
ble 1 shows that this is not a good strategy 
after all. 

In comp.os.linux.development.system, 
and rec.food.recipes we see that the ratings 
are more evenly distributed (see figure 3). 
Rec.food.recipes is a moderated newsgroup, 
so all of the posts are on topic, and there is no 
name calling or spamming. In addition users 
once again have a clear metric for rating an 
article: “Would I like to cook this?” However, 
as figure 4 shows, users in rec.food.recipes 
have a lower correlation. The reason for this 
is that ratings are based literally on taste. For 
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example two users may agree all the time on 
desert recipes, but one may be a vegetarian 
who rates all recipes with meat low, and the 
other may be a carnivore who rates recipes 
with meat high. 

In table 2 we summarize the results for 
rec.food.recipes. We see that the errors 
in this group are uniformly higher than for 
rec.humor. This can be attributed to the fact 
that the correlation between users is low for 
this newsgroup. However, we do see that the 
personalized predictions are better than aver- 
age predictions for all of our error metrics. 


method [Bl fo [r_| 


Table 2: 
Rec.food.recipes 







Summary of Results in 


Determining what to rate an article in 
comp.os.linux.development .system is more 
difficult than either of the two previous news- 
groups. When rating an article in this group a 
user must weigh several factors: 


e Is the article appropriate for this news- 
group? 


e Is the topic of the article interesting to 
me? 


e Is the article well written? 
e Is the article factually correct? 


Despite all of these factors the readers of 
comp.os.linux.development.system have a 
high degree of correlation. This may be be- 
cause the early adopters of the GroupLens sys- 
tem are all likely to be fairly sophisticated linux 
users. In table 3 we see that the personalized 
predictions are again more accurate than the 
average. 


4.2 Effect of Predictions on Users 


We'll now look at what effect, if any, the pre- 
dictions have on a user’s likelihood to read and 
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Table 3: Summary of Results. in 
Comp.os.linux.development.system 






rate a message. One result is clear: users are 
more likely to rate messages if they see that 
they are also getting predictions. Further, our 
analysis of the rating patterns of users shows 
that users are almost twice as likely to rate an 
article for which they see a prediction greater 
than three than they are for an article with a 
prediction of three or less. 


4.3 Bootstrapping 


One rather important lesson that we have 
learned over the course of this project is the 
difficulty of bootstrapping collaborative filter- 
ing in Usenet newsgroups. While we expected 
some inertia among users, and in particular 
recognized the difficulty in asking users to up- 
grade or change newsreaders, we were surprised 
by some of the social difficulties involved in 
bringing collaborative filtering to Usenet news. 

Collaborative filtering must have users to be 
useful. In fact we believe that a collaborative 
filtering system has built in incentives that en- 
courage more people to participate. To ensure 
value for trial users, we made an effort to add 
GroupLens support for newsgroups slowly, and 
only after ensuring that we had at least one or 
two active reader/raters who would keep the 
group going. We would then post a message 
to the newsgroup itself, inviting others to use 
GroupLens and pointing them to our web page 
for registration and software. It was here that 
we ran into a very tricky bootstrapping prob- 
lem. 

Discussing GroupLens was off-topic for al- 
most every newsgroup we encountered. Ac- 
cordingly, our messages were often ignored, 
and there was no follow-up discussion within 
the group. In retrospect, this is not surprising. 
A posting about a better way to read news 
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is not funny (and therefore does not belong 
in rec.humor), it is not about linux software 
and development (and therefore does not be- 
long in comp.os.linux.*), and similarly is out 
of place in the very newsgroups where we ex- 
pected it to be useful. Few Usenet news readers 
choose to read newsgroups about reading news. 

We did get a user base, albeit more slowly 
than we would have liked. We have since rec- 
ognized more effective ways of bootstrapping: 
working more closely with newsreader main- 
tainers to become part of the standard distri- 
bution, providing limited service for a wider 
range of newsgroups to create more general in- 
terest, and direct promotion through demon- 
strations and other publicity. 


5 Conclusions and _ Future 


Work 


The GroupLens trial has demonstrated the 
efficacy of collaborative filtering for Usenet 
news. We have learned about the challenges 
of this vast domain. Each newsgroup brings 
forward new characteristics that affect the ac- 
curacy of our predictions. The sheer volume 
of Usenet news has forced us to have an ef- 
ficient implementation. The numbers are in, 
and GroupLens provides value to participants. 
Anecdotal evidence supports this conclusion as 
we hear from users who long-ago abandoned 
the rec. humor newsgroup returning to it with 
GroupLens guiding them to a handful of funny 
articles in just a few minutes each day. Still, 
our work is far from finished. 

There are many areas of future work to re- 
duce the possible costs to users of using col- 
laborative filtering. These costs come in three 
forms: (i) time spent entering a rating; (ii) 
performance costs incurred by the GroupLens 
software; (iii) the time wasted in reading arti- 
cles that are predicted to be better than they 
really are. 

One way of reducing the time spent entering 
a rating is to rely on implicit measures of in- 
terest, such as how long you spend reading an 
article, or whether you print or file the article 
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after reading it. Collaborative filtering studies 
are needed to compare the costs and benefits 
of explicit ratings with implicit ratings. 

Performance costs for the current Grou- 
pLens system are low, but these costs increase 
with the number of users. One way to keep 
performance costs low is to develop distributed 
GLRBs that perform correlation and predic- 
tion independently on subsets of the user pop- 
ulation. Scaling GroupLens to the Internet 
will require distribution, to keep communica- 
tion and computation times. 

Time wasted reading articles with high pre- 
dictions that are actually uninteresting is diffi- 
cult to control. In practice, predictions become 
increasingly accurate as more readers rate the 
the article. One way to make it easier for users 
to use collaborative filtering to select quality 
articles would be to develop prediction algo- 
rithms that include a confidence measure for 
the accuracy of the prediction. Confidence 
measures would help users make the tradeoff 
between opportunity cost and expected value. 

GroupLens is an open architecture with 
freely distributable protocols. Anyone who 
wants to participate can write a news client to 
connect with our GLRB, or a new GLRB that 
offers improved service to our news clients. An 
open architecture encourages interaction and 
innovation by the community. For instance, 
one GroupLens participant has already written 
an proxy GLRB that downloads his ratings to 
Poland overnight so he gets better interactive 
performance! 

The client library further encourages partic- 
ipation, by simplifying the task of integrating 
GroupLens with new news clients. For most 
news clients, only 2-3% of the code must be 
modified to support GroupLens. Recently, sev- 
eral of the maintainers of popular news clients 
have announced support for GroupLens. We 
look forward to working with the community 
to add GroupLens support to additional news- 
readers. We also encourage the development 
of new filter-bots that use the client library to 
communicate computed ratings to GroupLens. 

Usenet is on the one hand a rich and 
valuable information resource, and on the 
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other hand a quagmire of the useless and the 
tasteless. GroupLens lets us team up to drain 
the quagmire and separate the valuable from 
the useless. If we work together we can each 
peruse a fraction of the articles submitted each 
day, in exchange for having the interesting 
articles pointed out to us. More participants 
means more ratings available to Grou- 
pLens, which means even better predictions. 
The GroupLens experience is on-going at 


http: //www.cs.umr.edu/Research/GroupLens. 


Join us in making Usenet useful again! 
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Abstract 

The recent interest in multimedia conferencing is a re- 
sult of the incorporation of cheap audio and video hard- 
ware in today’s workstations, and also as a result of the 
development of a global infrastructure capable of sup- 
porting multimedia traffic - the Mbone. Audio quality 
is impaired by packet loss and variable delay in the net- 
work, and by lack of support for real-time applications 
in today’s general purpose workstations. A considerable 
amount of research effort has focused on solving the net- 
work side of the problem by providing packet loss ro- 
bustness techniques, and network conscious adaptive ap- 
plications. Effort to solve the operating system induced 
problems has concentrated on kernel modifications. This 
paper presents an architecture for a real-time audio me- 
dia agent that copes with the problems presented by the 
UNIX operating system at the application level. The 
mechanism produces a continuous audio signal, despite 
the variable allocation of processing time a real-time ap- 
plication is given under UNIX. Continuity of audio is en- 
sured during scheduling hiccups by using the buffering 
capabilities of workstation audio devices drivers. Our so- 
lution also tries to restrict the amount of audio stored in 
the device buffers to a minimum, to reduce the perceived 
end-to-end delay of the audio signal. A comparison be- 
tween the method presented here (adaptive cushion algo- 
rithm), and that used by all other audio tools shows sub- 
stantial reductions in both the average end-to-end delay, 
and the audio sample loss caused by the operating sys- 
tem. 


1 Introduction 


The ability of current wide area networks, such as the 
Mbone, to support multimedia conferencing has been re- 
cently demonstrated by a series of multicast events. One 
of the first such events was the Internet Engineering Task 
Force meeting on March 1992 [1]. The renewed inter- 
est in multimedia conferencing is a result of the provision 
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of audio and video hardware in UNIX workstations and 
the development of multicast for the Mbone [2]. Of the 
media used in multimedia conferences, audio is the most 
important [3], as speech is the natural mechanism of hu- 
man communication, supplemented by video and shared 
text. Audio is also the only real-time service (video is 
normally slow-scan), which makes its provision at a rea- 
sonable quality of service (QoS) far more difficult. 

The Internet’s service model offers *best effort’ trans- 
mission, and is unable to provide the QoS guarantees 
needed for real-time traffic. As a result, audio qual- 
ity is impaired by packet losses and variable transmis- 
sion delays over the network, but it also suffers from the 
lack of support for real-time applications in general pur- 
pose UNIX operating systems. One approach to solv- 
ing network problems has been to provide QoS guaran- 
tees through resource reservation [4]. Another approach 
has been to provide network conscious applications that 
adapt to the problems presented by the Mbone [5]. The 
latter mechanism has the advantage of being deployable 
without any network modifications. 

Traditional time-sharing operating systems for general 
purpose workstations do not provide adequate support 
for real time applications [6]; they operate in an asyn- 
chronous fashion, and handle data in blocks [7}. Real- 
time audio applications are ’soft’, in that they have to 
keep acontinuously draining audio device driver fed with 
blocks of audio samples, but there is no specified instant 
when audio must be transferred, just a dead-line. How- 
ever, the lack of a delay bound for timers and external 
events makes it impossible to guarantee a regular supply 
of audio samples. Current real-time audio applications 
ignore the problem, which results in frequently disrupted 
audio, and increased end-to-end delay. 

A solution analogous to that of resource reservations to 
provide QoS guarantees is to modify the operating system 
scheduler to provide bounded dispatch latency for appli- 
cations [7, 8, 9, 10]. Despite experimental proof that this 
approach can significantly improve the perceptual quality 
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of multimedia applications, it has not been implemented 
in the majority of desktop workstation operating systems. 

There is a very large number of general purpose work- 
stations connected to the Internet, that will soon want to 
use audio applications with the current scheduler. We 
present a solution at the application level, that smoothes 
out scheduling jitter in audio applications using the work- 
station’s device driver buffering capabilities. The solu- 
tion is analogous to playout adaptation for solving the 
network jitter, in that it does not require any modifications 
to workstation hardware or operating system. 

This work is part of ongoing research at UCL to de- 
velop a robust and flexible audio conferencing compo- 
nent for use on general purpose workstations over the 
Mbone. Initially began as part of the MICE project 
[11], and later as parts of projects ReLaTe [12] and 
MERCI (Multimedia European Research Conferencing 
Integration, European Union Telematics Applications 
Programme #1007), the proposed solution has been 
implemented and evaluated in our Robust-Audio Tool 
(RAT) [5], which is now funded by an EPSRC project 
(13]. RAT provides real time audio connectivity between 
Participants in multicast multimedia conferences over the 
Mbone. RAT is currently supported under SunOS, So- 
laris, IRIX, HP-UX, FreeBSD and Win32 and ports are 
under way for Linux PCs and Digital Unix platforms. 

In this paper we analyse the audio scheduling problem 
and its implications. The solution and implementation 
in RAT is described, together with evaluation results that 
show the success of our method. 


2 Background 


The audio device can be visualised as two buffers: 


e The inputbuffer continually inputs samples fromthe 
analogue audio input (microphone) 


e The output buffer is fed with samples from the ap- 
plication and these are synchronously output to the 
loudspeakers. 


The rate of input sample accumulation, and sample out- 
put is the same (sampling rate). The operating system 
buffer in the device driver has a fixed upper limit, and is 
adjustable under most OS platforms. 

The audio application interacts with audio input and 
output through the two device driver buffers. Samples 
are read from the input buffer for processing and written 
out to the output buffer for playback. In contrast with ac- 
tual audio playback, the transfer of clocks of samples be- 
tween the audio application and the device driver is ad- 
hoc. Blocks are read from and written to the device driver 
buffers, and all audio processing takes place on this block 
unit size. The minimum block size is enforced by the 
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workstation’s processing power; the larger the block size, 
the less frequently the application has to read and write 
audio blocks. 


Figure | illustrates the basic functionality of our audio 
tool. The operation is similar to that of other existing con- 
ferencing audio tools like VAT [14], NeVoT [15] andIVS 
[16]. 

For the purposes of this paper the audio tool can be 
considered as a process that acts as an interface between 
the audio device and the network. The process provides a 
two way communication facility, and samples input from 
the audio device are processed and transmitted across 
the network to remote conference participants. Packets 
are received from the network, processed, and samples 
played out to the audio device. For a description of the 
functions involved in processing the audio samples and 
for more information on the structure of our audio tool 
see [5]. 


Device Driver: Audio Application 
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Figure 1: RAT operation 


Mbone audio tools use silence detection and suppres- 
sion to stop the transmission of audio when a conference 
participant is not active. The current algorithms oper- 
ate by measuring the average energy in sample blocks 
[17]. A decision about speech/silence is made by com- 
paring the average energy in a block against a thresh- 
old. Blocks that are judged to be above the threshold 
are labelled as ’speech’, and transmitted. Blocks that are 
judged to be below the threshold are labelled as ’silence’ 
and thrown away. This operation produces talkspurts (pe- 
riods of continuous speech packets) inter-spaced with pe- 
riods where no packets are transmitted. The threshold be- 
tween speech and silence is adjusted during periods of si- 
lence. 

The effect of sample loss on audio may produce sig- 
nificant degradation in the audio quality, depending on 
the length of individual gaps, and the frequency of occur- 
rence. Audio gaps indicate a pause in speech to the hu- 
man brain causing confusion and reducing intelligibility. 
This problem has been studied by the authors in [3, 5]. 
The minimisation of sample loss is a primary design goal 
in audio applications. 
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Inreal-time conferencing audio systems interactivity is 
known to be substantially reduced for round trip delays 
larger than 600 ms. Large round trip delays in conversa- 
tional situations increase the frequency of confusions and 
amount of both double talking and mutual silence [1 8]. 
The components of this delay in a real-time Internet au- 
dio tool are: 


e Processing and packetisation delay at transmitter. 


e Variable delay experienced by audio packets 
traversing the network. 


e Audio reconstruction buffering at the receiver nec- 
essary for smoothing out network delay jitter. 


Delay caused by queued audio samples in the oper- 
ating system device buffers. 


The minimisation of delay is also a primary design goal 
in audio applications. 


3 Audio playback problem 


There are two main goals when considering operating 
system effects for achieving good quality real-time audio 
communication: 


e Continuous audio playback with no unnecessary 
breakups. 


e Low delay in device driver buffers to minimise the 
end-to-end delay, so that interactivity and normal 
communication pattems are preserved. 


To play-back audio, buffers of samples have to be 
transferred to the audio device driver. If all the samples 
in the device driver are played out before more are sup- 
plied, then the buffers run dry, and audio playback stops. 
To avoid the resulting gap of silence in the output audio, 
the transfer of samples must take place regularly. 

Modern workstation audio device drivers provide a 
selection of audio sampling frequencies. Conferencing 
applications usually aim to transmit telephone quality 
speech, and consequently would like to use a sampling 
frequency of 8KHz. However, audio sampling frequency 
crystals usually do not have a nominal frequency of ex- 
actly 8kHz, and can vary considerably from one worksta- 
tion to the next [19}. Timing events using the worksta- 
tion clock without compensating for the drift in the au- 
dio crystal will lead to one conference participant’s au- 
dio buffers becoming full, while a remote workstation’s 
buffers may run dry. To simplify operations the sam- 
pling clock of the audio device is used instead. This is 
achieved by using the number of samples read from the 
audio device as an indication of the amount of time that 
has elapsed. Since one crystal is used for audio input and 
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playback, the number of samples read gives a very accu- 
rate count of the number of samples played out. With the 
accurate knowledge of the number of samples that have 
been consumed by the audio device, a receiver can cal- 
culate the drift from a transmitter and compensate by ad- 
justing the duration of silence periods. 


4 Implications for a Real-Time Audio Tool 


Networked audio tools have a strict loop of operation in 
order to meet real-time timing restrictions. In each cycle 
of this loop the following basic operations take place: 


e A block of samples is read in from the device driver 
input buffer. 


e An equivalent number of received and processed 
samples are written out to the audio device. 


e Other processing takes place which may involve 
packets being transmitted onto or received from the 
network. 


Using this order of operations attempts to ensure that 
as many samples as needed are fed to the output buffers 
of the audio device. Feeding more would create a back- 
log of samples in the device driver to play out and would 
increase the delay. Providing less would result in gaps of 
silence in playback because of the buffers running dry. 

On a non-multi-tasking machine, keeping up with this 
loop is not hard to achieve. Under a time-sharing operat- 
ing system - like UNIX - this may not always be possible. 
The UNIX scheduler decides when a process gets control 
of the CPU. Processes are serviced according to their pri- 
ority, and there is no useful upper bound on the amount 
of time a process may be deprived execution [7, 8]. 

As a result of other processes being serviced, a rela- 
tively large amount of time may elapse in between two 
instances of the audio tool being scheduled. If the time 
between schedules is larger than the length of audio data 
that was written last time, then an audible gap of silence 
will result in the output audio signal. The persistent effect 
of the gap in the audio output is to restrict the timing of the 
output, with respect to the input; extra delay is accumu- 
lated in the device driver, since each block of samples is 
played out later than it should have been. Measurements 
have shown that the accumulated interruptions caused by 
an intensive external event on a loaded workstation, can 
create a delay of several hundreds of milliseconds over 
the period of a cycle of operation of our program. Since 
the audio system is timed from the read operations of the 
audio device, the time that elapses when the audio tool 
process does not have control of the CPU will be evident; 
there is lots of audio in the input buffer of the audio de- 
vice waiting to be read. In response to the waiting input 
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audio, the process will execute the loop several times in 
succession, until it reads the blocks in. For each of the 
blocks read, a block of equivalent size will be written out 
to the output buffiers. (Figure 2) shows the delay increase 
diagrammatically. 


Input buffer level 





' Device Buffer 


Time 


-2---- ~ 


‘Buffer Dry: 


Figure 2: Accumulation of delay in output buffer 


The extra buffered audio samples in the device driver 
prevent additional gaps, as there is now extra audio to 
play out while the audio tool process is not executing. 
Current Internet conferencing audio tools like VAT [14], 
NeVoT [15] and IVS [16] rely on this effect. During a 
talkspurt, the first CPU starvation causes a gap in the au- 
dio. Successive starvations will only cause a gap if they 
are longer in duration than the longest starvation so far. 
Consequently, the delay caused by the buffered samples 
in the audio device increases up to the length of the maxi- 
mum interruption. At the end of the talkspurt the accumu- 
lated delay is zeroed, as transfer of samples to the audio 
device stops, and the audio device output buffer drains. 
If silence suppression is not enabled in the transmitting 
tool, then the increase in delay will persist throughout the 
duration of the session. 


5 Adaptive buffer solution 


In order to eliminate gaps in the audio signal, and min- 
imise delay in the device driver buffer, adaptive control 
of the buffers is sought. The adaptionalgorithm will trade 
off the two variables. The adaption algorithm controls the 
amount of audio in the buffers, and ensures that there is 
always a minimum amount of audio to reduce the gaps 
caused by scheduling anomalies. Monitoring ahistory of 
the size of scheduling anomalies means that excessive au- 
dio is not buffered. 

Intuitively, a situation where the amount of samples in 
the device driver buffer (called the cushion!) can cover 
for small frequent interruptions is ideal. This principle 
will not introduce excessive delay, but large infrequent 


1997 Annual Technical Conference 


interruptions will still cause a gap in the audio signal. The 
size of the cushion must be determined by the current per- 
formance of the workstation, which can be estimated by 
analysing ahistory of scheduling anomalies. As new pro- 
cesses start, or external events happen, the overall per- 
ceived load and behaviour of the scheduler will change. 
In order to maintain the desired sound quality and min- 
imise the delay, the cushion size must be adaptive, and 
should attempt to reflect the state of the workstation. 


Device Driver : Audio Application 
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Figure 3: Adaptive cushion algorithm 


5.1 


An estimate of the state of the workstation can be 
achieved by a simple modification to the structure of the 
audio application. 

In the audio tool application loop, instead of reading a 
block of fixed size from the audio device driver, a non- 
blocking read is made, and all the stored audio is re- 
trieved. The amount of audio gives an exact measure of 
the amount of time that has elapsed since the last time the 
call was made. 


Measuring workstation state 


Elapsed time (ms) 
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Figure 4: Read length variation 


A history of the time between successive calls refiects 
the loading of the workstation. Figure 4 shows measure- 
ments from a lightly-loaded SUN Sparc 10 workstation 
over a period of two minutes. There are readings as of- 
ten as every 10ms, which gives enough load samples to 
be able to monitor load changes as they happen. 
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There is an exception event to cater for in the imple- 
mentation; when the amount of audio that is returned 
by the read call is equal to the total size of the device 
driver buffer. It is likely that an overflow has occurred, 
and some input samples have been lost, and therefore 
the track of time. The audio tool process resynchronises 
using external mechanisms like the workstation clock. 
This situation does not happen very often, since the to- 
tal length of the device driver buffer is configurable, and 
is usually set large enough to cater for a few seconds of 
continuous audio input. 


5.2 Cushion size estimation 


Based on the workstation load information, the target fill 
level for the device driver buffer can be estimated. 

Figure 5 shows the distribution of the measurements 
presented in figure 4. The X axis represents the elapsed 
time in milliseconds. Y axisis logarithmicand shows the 
number of times each different value occurred. It can be 
seen that the vast majority of measurements are smaller 
or equal to 40 milliseconds (320 samples). However the 
largest measurements are 130 milliseconds long. 
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Figure 5: Load measurements distribution 


The estimation algorithm should maintain a cushion 
level that will cater for the majority of scheduling anoma- 
lies, without resulting in gaps in the playback. This might 
be trivially achieved by maintaining output buffier levels 
at the maximum measured read length, but this will re- 
sult in excessive delay being introduced. The adaptation 
algorithm introduced here allows the user to specify how 
much audio breakup should be traded for a reduction in 
delay. 

The algorithm operates by maintaining a history of past 
load measurements. After a new measurement, the de- 
sired cushion size is estimated based on the recent history. 
The load measurements are stored in a circular buffer, 
and to avoid the processing overhead of examining all the 
logged data, a histogram of load measurements is main- 
tained. The histogram is incrementally built, and each 
time a new load measurement is made, the oldest one in 
the circular buffer is removed from both the buffer and 
the histogram, and replaced with the new one. An esti- 
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mate is made by examining the histogram, and calculat- 
ing a cushion value that will cover a given percentage of 
measurements. 

Let 6; be the number of load samples of length i and h 
the length of the history. Then we have: 


If ¢ is the threshold of read lengths which is being catered 
for, then the cushion c is given by: 


zx 
c=minz:) >t 


i=1 


The cushion adaptation algorithm is configurable by 
adjusting the history length A and by varying the thresh- 
old ¢ of load measurements that are going to be catered 
for. 

The desirable behaviour would be to follow the trend 
in scheduling load, and avoid rapid jumps in cushion size 
due to very short-lived bursts. This can be achieved by in- 
creasing the history length, which ineffectlow pass filters 
the load information. The increase in the history length, 
however, is at the expense of fast adaption since the cush- 
ion estimate now depends on older data. 
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Figure 7: Stable cushion 


Figure 6 shows the estimated cushion for the data pre- 
sented in figure 4. A relatively small averaging period 
was used (200 measurements), and therefore the cushion 
varies continuously in an attempt to match the changing 
workstation performance. Figure 7 shows the estimated 
cushion using a larger averaging period (2000 measure- 
ments). The estimate is a lot more stable and represents 
the trend in workstation load. By increasing the averag- 
ing period, a reduction in average delay and a small in- 
crease in audio gap has resulted. 
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It is important to note that the algorithm design aims to 
smooth out short-lived scheduling anomalies, and cannot 
provide a solution if the workstation cannot (on average) 
keep up with the requirements of the real-time process. 
This can be detected if there is a constant trend in cush- 
ionincrease. In this case, the audio tool shouldreduce the 
amount of processing performed (load adaption). This 
work is planned as part of the EPSRC RAT project [1 3]. 


5.3 Buffer adjustment 


Maintaining a given amount of audio in the device driver 
buffer can easily be accomplished. The output buffer 
is initialised with a given number of samples (cushion 
size). On each cycle of the program, we calculate how 
many samples remain in the device driver buffer using the 
elapsed time given by the read length: 


samples left in cushion = cushion size — read length 


The cushion is refilled by writing out the required number 
of samples. An outline of the code that achieves this is 
given in figure 8. 


7* Initialise cushion */ 
write(fd, mixed_audio_buffer, 
fOr (3) 


{ 


cushion_size); 


/7* Read all there is (non blocking) */ 


elapsed_time = read(fd, input_buffer, MAX_SIZE); 


7* Refill cushion */ 
write(fd, mixed_audio_buffer, elapsed_time) ; 


7* Do all other processing */ 


Figure 8: Maintaining the cushion size 


When the elapsed time exceeds the cushion because of 
a scheduling anomaly, then we shouldn’t write back as 
much as we read, because this will effectively increase 
the cushion size. The cushion should be re-initialised by 
writing out its full size in this case. To maintain syn- 
chronisation between the input and playback operations 
the extra samples that were not written out are discarded. 
As these samples correspond to the silence period that 
resulted from the cushion overrun, no additional distur- 
bance results from us discarding them and the timing re- 
lationship in the played out signal is maintained. 


5.3.1 Varying the cushion size 


The methods described here vary the cushion size in line 
with the desired value. The cushion size reflects the cur- 
rent state of the workstation, which may change in re- 
sponse to other processing on the workstation. 
Changing the cushion size during silence periods has 
minimal effect on the perception of audio [20]. It is 


thus preferable to make any adjustments necessary dur- 
ing such periods. 

However, it is possible to alter the cushion size during 
a talkspurt. A decrease in size will result in a reduction of 
the end-to-end delay that will be perceived at the end of 
the talkspurt. This can be achieved by simply writing out 
fewer samples than those needed torefill the current cush- 
ion. At this stage we have to decide what to do with the 
remaining samples that we did not transfer to the audio 
device. The simplest solution is to discard them. If the 
change in cushion size is small enough, as is usually the 
case, then the missing samples will not be noticed [20]. 

Increasing the cushion size is not as easy as decreas- 
ing it, since extra samples are needed. However, there 
may be extra audio available if the decision to increase 
the cushion follows a scheduling anomaly (see section 
5.3). If there isn’t extra audio available to fill the gap, 
then it has to be artificially created. Techniques for ar- 
tificially creating extra audio required during periods of 
sample loss have been extensively studied by the authors 
(3, 5). 


6 Discussion of results 


The adaptive cushion algorithm has been implemented 
and tested in RAT. It has produced a perceivable reduc- 
tion in the end-to-end delay of the system. 

A simulator was built to evaluate the performance of 
the adaptive cushion algorithm. The simulator inputs 
load measurements and talkspurt information, and uses 
these to calculate the resulting delay and audio gaps for 
different adaptation strategies. In our experiments, we 
used real load measurements collected from RAT, and 
modelled talkspurt and pause durations after statistical 
data given in [21]. The data presented below uses load 
information collected on a medium loaded Sun Sparc 10 
workstation running Solaris 2.4 over a period of 20 min- 
utes. Results were collected during a multicast multime- 
dia conference, while transmitting and receiving audio 
with RAT and moving video with vic [22]. 


Adaptation | Read % | Avg del | Delo 


No cushion 
195/200 


970/1000 
1800/2000 
1970/2000 





Table 1: Audio delay results (in ms) 


Table 1 shows resulting end of talkspurt delay values 
for different adaptation conditions. The first row repre- 
sents operation of the audio tool without the use of the 
cushion mechanism - as is used in all other existing audio 
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tools. The remaining rows represent results from the use 
of different parameters in the adaptive cushion algorithm. 
The two values in the column describing the adaptation 
parameters show the number of load measurements that 
are required to be covered by the cushion and the length 
of the history that is kept. The column labeled "Read 
Jo” gives the ratio of these two values indicating the per- 
centage of measurements that are expected to be covered 
by the cushion. For each adaptation method the average, 
standard deviation and maximum of the delay are given 
in milliseconds. 

Table 2 shows resulting audio gaps in talkspurts for the 
same adaptation conditions. 


No cushion 34 43 
195/200 

970/1000 

1800/2000 

1970/2000 


Total gap 





Table 2: Audio gap results (in ms) 


It can be seen that the adaptive cushion algorithm re- 
sults in improvement to both gap size and delay compared 
to the standard mechanism used in other audio tools. 


7 Conclusion 


The recent interest in multimedia conferencing is a result 
of the incorporation of cheap audio and video hardware 
in today’s workstations, and also as a result of the devel- 
opment of a global infrastructure capable of supporting 
multimedia traffic - the Mbone. Audio quality is impaired 
by packet loss and variable delay in the network, and by 
lack of support for real-time applications in today’s gen- 
eral purpose workstations. 

This paper has presented an adaptive cushion algo- 
rithm that copes with the problems presented to real-time 
audio conferencing applications by a general purpose op- 
erating system. The continuity of the audio signal is en- 
sured during scheduling anomalies by using the buffer- 
ing capabilities of the audio device driver. The algorithm 
also restricts the amount of audio in the output audio de- 
vice buffer to minimise the end-to-end delay. Negligible 
overhead in processing power is incurred with the adap- 
tive cushion algorithm since ahistory of workstation load 
is built up incrementally over time. 

The results presented in this paper show that there is 
a significant improvement in both minimisation of delay 
and audio gap size to be obtained from using the adaptive 
cushion algorithm. 


8 Further work 


Work is continuing in this area to tune the adaptive cush- 
ion algorithm. In particular we hope to identify a map- 
ping between different user preferences and suitable his- 
tory lengths for multi-way interactive multimedia confer- 
ences. 
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Abstract 

Existing user-level thread packages employ a “black 
box” design approach, where the implementation of 
the threads is hidden from the user. While this ap- 
proach is often sufficient for application-level program- 
mers, it hides critical design decisions that system- 
level programmers must be able to change in order 
to provide efficient service for high-level systems. By 
applying the principles of Open Implementation Anal- 
ysis and Design, we construct a new user-level threads 
package that supports common thread abstractions 
and a well-defined meta-interface for altering the be- 
havior of these abstractions. As a result, system-level 
programmers will have the advantages of using high- 
level thread abstractions without having to sacrifice 
performance, flexibility, or portability. 


1 Introduction 


Lightweight threads are useful for a variety of pur- 
poses. An application-level programmer will typically 
use threads to facilitate asynchronous scheduling for 
a number of related tasks. For example, consider 
an event-driven application, such as Xlib [25], where 
lightweight threads are used to schedule tasks for ex- 
ecution based on an external event. In this context, 
threads free the programmer from the details of dy- 
namic scheduling. Fine-grain control over the behavior 
of a thread is typically not needed. There is no short- 
age of lightweight thread packages for application-level 
programmers, and a short list of such systems would 
likely include pthreads [22] (the POSIX interface for 
lightweight threads [16]), Solaris threads [26], fast- 
threads [2], and cthreads [23]. 

Lightweight threads are also useful for supporting 
independent tasks generated by parallel or concur- 
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rent programming languages. In this context, system- 
level programmers use lightweight threads as a major 
building-block of a multithreaded runtime system. For 
example, each of the following languages is supported 
by a multithreaded runtime system: CC++ [12], For- 
tran M [12], Opus [15], Orca [7], PC++ [8], Sisal [13], 
Split-C [11], and SR [3]. However, none of these mul- 
tithreaded runtime systems employs a single thread 
package for all platform implementations, and few use 
any of the lightweight thread packages listed in the 
preceding paragraph. While this may be surprising, 
there are several reasons why multithreaded runtime 
system designers shy away from standard lightweight 
thread packages, including: 


1. Lack of flexibility. Existing thread packages are 
implemented as black boxes, so it is almost al- 
ways impossible to change the detailed behavior 
of threads, mutexes, run lists, etc. However, most 
multithreaded runtime systems require explicit 
control over scheduling decisions and the interac- 
tion of threads with a communication substrate. 
For example, the Panda runtime system [7], which 
supports the Orca programming language [5], re- 
quires preemptive scheduling of threads with pri- 
orities, and the ability to turn preemption off and 
explicitly poll for incoming messages when there 
are no active threads. On the other hand, the run- 
time system that supports the SR programming 
language [3] assumes non-preemptive scheduling, 
in which the scheduler is free to select the next 
thread to the executed from a list of runnable 
threads. Both languages support communication 
between multiple processors, and this communi- 
cation directly affects thread scheduling. 


2. Lack of performance. Existing thread packages 
are geared towards supporting application-level 
programmers, who typically require threads to 
behave as normal Unix processes would. How- 
ever supporting this behavior, including proper 
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signal handling, adds a good deal of overhead 
to the thread operations. In contrast, most 
multithreaded runtime systems want bare-bones 
threads that are very fast, and to which they will 
add the complexities they require. The key here 
is that the runtime system designer, rather than 
the thread package designer, is in control of the 
tradeoffs between functionality and performance. 


3. Lack of portability. Multithreaded runtime sys- 
tems must execute on a wide variety of hard- 
ware and operating system platforms, which is ex- 
tremely problematic for most lightweight thread 
packages. 


4. Lack of information. Multithreaded runtime sys- 
tem designers often require tracing information 
for debugging and statistics information for tun- 
ing. Existing thread packages provide little or 
no support for obtaining this sort of information 
about the execution of the threads. 


The central problem with using existing thread 
packages to support multithreaded runtime systems 
is that most existing thread packages are designed as 
“black-boxes,” providing only a very limited number of 
ways in which behavior can be modified. This is typ- 
ically limited to altering the default size for a thread 
stack and choosing between a small, fixed-number of 
pre-implemented thread scheduling policies. This is 
almost always too restrictive for system-level program- 
mers, and so most end up “rolling their own” thread 
packages. 

Though “black box” abstractions are effective for 
constructing complex systems because they hide the 
details of an implementation, they don’t always work 
because “there are times when the implementation 
strategy for a module cannot be determined before 
knowing how the module will be used in a partic- 
ular system” [18]. As we’ve seen, this is particu- 
larly true for multithreaded runtime systems employ- 
ing lightweight threads, where many of the implemen- 
tation decisions are to be made by the runtime system 
designer, not the thread package designer. Recently, 
the object-oriented research community has been ad- 
dressing the issue of improved design methodologies 
for substrate (system-level) software [9, 18, 19, 20, 27]. 
From this research, a new design methodology for sub- 
strate software has emerged, called Open Implementa- 
tion [18, 19]. The basic idea behind this new design 
methodology is to open the proverbial black box using 
a well-defined interface, called a meta-tnterface, that 
describes how the abstractions provided in the user- 
interface are to behave. 

In this paper we provide an Open Implementation 
analysis and design of lightweight, user-level threads. 
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As a proof-of-concept, we have implemented this de- 
sign to create a thread library called OpenThreads. 
The goal of this research is to produce a lightweight 
threads library that provides both a simple user-level 
interface and a robust meta-level interface for alter- 
ing the behavior of the abstractions, so that a single 
threads package can be efficient, flexible, and portable. 
In addition, we extend the Open Implementation de- 
sign methodology to address portability concerns by 
defining a system-level interface that clearly defines 
underlying dependencies. 

The remainder of this paper is divided into five sec- 
tions. Section 2 provides background on threads and 
Open Implementation analysis and design. Section 3 
provides the Open Implementation analysis and design 
for OpenThreads, and discusses the three interfaces in- 
terfaces. Section 4 provides a discussion of our design 
relating to performance and open issues. Section 5 
outlines related research projects, and we conclude in 
Section 6. 


2 Background 


In this section we provide background information for 
readers unfamiliar with either lightweight, user-level 
threads or Open Implementation Analysis and Design. 


2.1 User-level Threads 


User-level threads provide the ability for a program- 
mer to create and control multiple, independent units 
of execution entirely outside of the operating system 
kernel! (i.e., in user-space). The state of these threads is 
often minimal, consisting usually of an execution stack 
allocated in heap space and the set of CPU registers, 
and so these threads are often termed lightwezght. 
Since the OS kernel controls addressing and schedul- 
ing for the CPU, user-level threads must be multi- 
plexed atop one or more kernel-level entities, such as 
Unix processes [4], Mach kernel threads [1], or a Sun 
Lightweight Processes (LWP) [24]. This “multiplex- 
ing” is commonly referred to as scheduling of the user- 
level threads. The kernel-entity (hereafter referred to 
as a “process” ) also provides a common address space 
that is shared by all threads multiplexed onto that 
process, and synchronization primitives are provided 
to keep the memory consistent. It is also possible for 
threads to have some amount of thread-specific data by 
storing pointers to this data on each thread stack. 
Scheduling policies for lightweight threads can be 
broadly classified either as non-preemptive, in which a 
thread executes until completion or until it decides to 
willingly yield the processor, or as preemptive, in which 
a thread can be interrupted at an arbitrary point dur- 
ing its execution so that some other thread may ex- 
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ecute. Orthogonal to the issue of preemption, thread 
scheduling can incorporate a wide range of capabilities, 
including priorities, hand-off scheduling, and arbitrary 
run list structures (such as trees). Clearly, there are 
many design choices to be made for thread schedul- 
ing, and many runtime system designers will want to 
switch between various policies depending on the state 
of the system. If a user-level thread package is not use- 
ful to a system-level programmer, lack of contro] over 
scheduling is commonly at the root of the cause. 


2.2 Open Implementation Analysis 


and Design 


In [18], Isiczales introduces a new approach for the 
design of substrate software called Open Implemen- 
tation, and in this section we summarize these ideas. 
The reader is encouraged to examine [18] and [20] for 
more details on this design philosophy. 

We have already stated that black-box abstractions 
do not always work because “there are times when 
the best implementation strategy for a module can- 
not be determined without knowing how the mod- 
ule will be used in a particular situation” [18]. So, 
what happens when a programmer using a black-box 
abstraction is confronted with conflict between how 
the abstraction is implemented and how the abstrac- 
tion should be implemented? Since the current imple- 
mentation is hidden within the internal portion of the 
black-box (as depicted in Figure 1), the programmer 
must “code around” the problem. This results in ei- 
ther hematomas of duplication or coding between the 
lines. 

A hematoma of duplication occurs when a system- 
level programmer writes his own threads package, en- 
suring that his performance and flexibility demands 
are met. In addition to increasing the size and com- 
plexity of the resulting system, hematomas can result 
in convoluted code above the runtime system, where 
the black-box thread package may still be used. 

Coding between the lines occurs when a program- 
mer writes code in a particularly contorted way to get 
the desired performance or functionality. For exam- 
ple, consider a multithreaded runtime system designer 
who plans to create and destroy many threads. If the 
underlying thread package does not provide a way to 
cache the thread control blocks and thread stacks, the 
programmer might have to create server threads that 
never really dies so that resource management over- 
heads are minimized. 

These examples demonstrate that black-box designs 
often hide too much of the implementation for sub- 
strate software. While some of the implementation de- 
cisions are details that can be (and should be) hidden 
without problem, others are crucial to writing efficient 


software and should be exposed in a controlled man- 
ner. These crucial design decisions are called dilem- 
mas. 

The Open Implementation design depicted in Fig- 
ure 1 provides a mechanism for exposing dilemmas 
to the programmer so that these crucial design de- 
cisions can be made on a per-application basis. This 
mechanism is represented as a new interface, called 
the meta interface, that is presented to the applica- 
tion programmer for altering the behavior of the ab- 
stractions presented in the user interface. The meta 
interface provides a clean and controlled mechanism 
for customizing the implementation of substrate soft- 
ware and is the key to the Open Implementation design 
philosophy. Section 3.2.2 details the meta interface for 
OpenThreads. 


3 Design 


In this section we outline the design of OpenThreads. 
In Section 3.1 we itemize the issues are faced when 
designing a lightweight thread package, and in Sec- 
tion 3.2 we explain how these issues are exposed to 
the user in terms of interfaces. 


3.1 Dilemmas 


We begin this section with an examination of the 
dilemmas that occur in the design of a thread package. 
As we stated in Section 2.2, the key difference between 
a black box design and an open implementation design 
is the level of control over these dilemmas. 


3.1.1 Thread States 


The lifetime of a thread is marked by a series of 
transformations between different states. A common 
set of thread states includes: being created, being 
placed onto an active run list, being selected (sched- 
uled) for execution, being blocked on a mutex or con- 
dition variable, and being terminated. The transitions 
that take a thread from one state to the next can dif- 
fer from one thread package to the next. For example, 
a blocked thread can either be resumed as an active 
thread or as a runnable thread. Figure 2 depicts these 
states transitions. 

Besides the dilemma of where to place a thread 
that becomes unblocked, there are dilemmas associ- 
ated with the transitions between the states. For ex- 
ample, what should be done when a thread is created 
or terminated? How about when a thread is in transi- 
tion between the active and runnable states? In exist- 
ing thread systems, these transitions are hidden from 
the user. This makes it impossible, for example, to 
trace the execution of a thread, since the user would 
need control over several transitions. 
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Figure 1: “Black-box” and “Open Implementation” designs 
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Figure 2: Thread states 


Our approach to these dilemmas is to define a set 
of events that occur (perhaps repeatedly) during the 
life of a thread, and provide a mechanism for specify- 
ing user-defined actions to be performed whenever a 
thread encounters one of these events. As depicted in 
Figure 2, the following events are defined for a thread: 
entry - when a thread first begins execution; exit - 
when a thread is about to be terminated; save- when a 
thread moves from an active state to a runnable state; 
restore, when a thread moves from arunnable state to 
an active state; blocked, when a thread moves from 
an active state to a blocked state; and unblocked, 
when a thread moves from a blocked state to either an 
active or runnable state. 


In addition to these thread-specific states, there 
are system states of either executing some runnable 
thread or being idle because al] remaining threads are 
blocked. The idle system state is often achieved when 
all threads are waiting for some external event, such 
aS a message or interrupt, to occur. The transition 
of the system from the active state to the idle state 
is also important, and so we define three more events 
to cover this situation: idle begin - when all threads 
in the system become blocked; idle spin - the dura- 
tion of being idle; and idle end - when some thread 
becomes runnable again. 
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3.1.2. Thread Lists and Scheduling 


A thread list is a data structure used to hold a collec- 
tion of threads. The list may be used to hold threads 
that are ready to execute, called a run list, or may 
be used to hold threads in some other state, such as 
blocked on a mutex or condition variable. Since all 
threads must be on some thread list, state transitions 
typically involve moving a thread from one list to an- 
other. The scheduling policy for threads is therefore 
determined by the structure of the thread lists and the 
implementation of the operations that remove a thread 
from a list (get) and place a thread onto a list (put). 
For example, simple FIFO scheduling is achieved by 
using a FIFO queue for a run list, whereas priority 
scheduling might involve a tree of FIFO queues, where 
each leaf of the tree represents a queue of threads with 
the same priority. 


Existing thread packages hide the concept of thread 
lists and provide abstract notions of scheduling for 
the system, such as FIFO or Round-Robin. How- 
ever, this black-box approach prevents system-level 
programmers from gaining contro] over the most fun- 
damental part of a thread package. For example, what 
if a tree structure would be most efficient for a run 
list, or what if multiple run lists are desired? What 
if mutex variables are to have thread lists which are 
scheduled differently from the run list, or from condi- 
tion variable blocked lists? These dilemmas about the 
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nature and behavior of thread lists need to be exposed 
to the system-level programmer. 

Our approach to these dilemmas is to make thread 
lists explicit, first-order entities in the system, and al- 
low the user to define the get and put primitives for 
each list. Thread lists (or queues) are then explic- 
itly associated with mutexes, condition variables, and 
runnable threads. OpenThreads provides mechanisms 
for specifying which list a thread is to be placed onto 
when yielding (ot_thread_yield_onto) and when ini- 
tializing a thread (as an argument to ot_thread_init). 
Note that although there are multiple thread lists, only 
one at a time may be designated as the official “run 
list”, which is where ready threads are selected for ex- 
ecution. The run list is specified as an argument to 
the ot_begin.mt function. 


3.1.3 Context Switching Modes 


Every thread maintains state information that defines 
the thread. Minimally, this set includes the stack 
pointer, the instruction pointer, and the contents of 
the live registers. During a context switch between 
two threads, the state information of the old thread is 
saved and the state information of the new thread is 
restored. 

Existing thread packages treat all context switches 
the same with respect to the state of a thread. How- 
ever, this should not always be the case. Some threads, 
for example, may only use the integer registers, so sav- 
ing and restoring the floating point registers at each 
context switch is a waste of precious cycles. But since 
the thread package designer doesn’t know if threads 
will be using the floating point registers or not, a con- 
servative decision is made and all registers are saved. 

Our approach to these dilemmas is to allow for the 
context of a thread to be defined as either involving all 
registers, only the integer registers, or no registers. A 
more flexible approach might allow for a user-defined 
function that saves the exact thread state needed, but 
such a function would almost certainly be platform- 
dependent, thus violating portability. 


3.1.4 Stack Management 


Often, the most time-consuming portion of creating a 
new thread is allocating and aligning the thread stack. 
While most thread packages offer support to change 
the size of a thread’s stack, few allow the programmer 
to specify the stack allocation policy. Lack of control 
over this dilemmais another common reason why mul- 
tithreaded runtime systems abandon common thread 
packages. For example, the programmer may want to 
cache the stacks because threads are created and de- 
stroyed rapidly, but there are never more than a small 
number of threads alive at any one time. Another ex- 


ample is when the programmer wishes to enable checks 
on stack overflow, or to resize the stack at runtime to 
enable stack growth. 

Our approach to this dilemma is to allow the user 
to specify stack allocation and release policies. 


3.1.5 Timing and Profiling 


Existing thread systems offer little or no support for 
thread timing and profiling. While most application- 
level programmers may not care about how many 
times a given thread is switched or what the total 
execution time for a thread is, most system-level pro- 
grammers do care about these measures. However, be- 
cause the state transitions that define these measures 
are hidden from the user, it is usually impossible to 
gather these statistics even if the user wants to. 

Our solution to this dilemma is to allow the user to 
take advantage of our exposed thread events and in- 
stall monitoring code that will be executed whenever 
a thread reaches one of these events. For example, to 
count the number of times that a thread is switched 
from runnable to active, we can simply install the fol- 
lowing function to be invoked whenever a thread trig- 
gers the restore event: 


void bump () { 
ctxswCountert+; 


a 


3.2 Interfaces 


We now discuss a mechanism for making these dilem- 
mas available to the user in a clean and well-defined 
manner. Recall that in Figure 1, there are two in- 
terfaces presented to the user rather than just one. 
The first is the user-level interface, which defines the 
abstractions that are supported. The second is the 
meta-level interface, which defines how to change the 
behavior of the abstractions supported in the user-level 
interface. Thus, the meta-level interface represents 
the realization of our design dilemmas. We call this a 
meta-interface because it’s an interface that describes 
how another interface (the user-interface) should be- 
have. 


3.2.1 The User-level Interface 


The user-interface for OpenThreads provides a sim- 
ple and clean mechanism for creating threads, mutex 
variables, condition variables, and thread lists. No- 
ticeably absent are the plethora of routines that define 
the API for packages like pthreads, which is possible 
because the user-interface for OpenThreads is not con- 
cerned with modifying behavior. As with any thread 
package, OpenThreads allows the programmer to cre- 
ate new threads, yield, exit, wait for a mutex, and 
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extern void ot_init (int *argc, char *argv([], char *pkg_prefix) ; 
extern void ot_start (ot_queue_t *runq) ; 
extern void ot_end (void); 


typedef void (ot_userf_t) (void *p0) ; 

extern void ot_thread_init (ot_thread_t *thread, ot_userf_t *start, 
void *args, unsigned tid, ot_queue_t *runq) ; 

extern void ot_thread_yield (void); 

extern void ot_thread_yield_onto (ot_queue_t *destq) ; 

extern void ot_thread_exit (void) ; 

extern unsigned ot_thread_id (void) ; 

exter ot_thread_t *ot_current_thread (void); 

extern void ot_thread_setspecific (ot_thread_t *thread, void *ptr, 
void (*cleanup) (void*) ) ; 


extern void *ot_thread_getspecific (ot_thread_t *thread) ; 


extern void ot_queue_init (ot_queue_t *q); 


extern void ot_mutex_init (ot_mutex_t *m, ot_queue_t *blockq) ; 
extern void ot_mutex_lock (ot_mutex_t ¥*m) ; 

extern int ot_mutex_trylock (ot_mutex_t *m); 

extern void ot.mutex_unlock (ot_mutex_t ¥*m); 


extern void ot_cond_init (ot_cv_t *cv, ot_queue_t *blockgq, 
ot_mutex_t *m); 

extern void ot_cond_wait (ot_cv_t *cv); 

extern void ot_cond_bcast (ot_cv_t *cv); 





Figure 3: OpenThreads user interface 
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block on a condition variable. These functions are de- 
tailed in Figure 3. 

There are a few calls in this interface that bear spe- 
cial attention. First, the thread initialization func- 
tion used to create a thread, ot_thread_init, takes 
a thread list (queue) as an argument, and places 
the newly created thread on that list. This gives 
the programmer explicit control over managing run 
lists. Second, the ot_-thread_yield_onto routine is 
used to specify a destination thread list upon which 
the existing thread will be put. Again, this gives 
the programmer explicit control over thread list man- 
agement. Third, the ot_thread_setspecific and 
ot_thread.getspecific calls are used to assign and 
retrieve a single generic pointer contained within the 
thread control block. This single pointer is meant 
to satisfy the needs of multithreaded runtime sys- 
tem designers, who all employ the concept of a task 
within their systems. Each task contains an instance 
of an OpenThread, which will perform the actual con- 
text switching between the tasks. However, since 
the current thread can only be defined in terms of 
an OpenThread, finding the current task requires a 
pointer from the OpenThread control block back to 
the surrounding task. Any additional thread-specific 
data can be multiplexed atop this single pointer with- 
out all users having to pay the cost in time and space 
to maintain an arbitrarily-long list of thread-specific 
pointers. 

One problem with many thread packages is the 
vagueness with which multithreaded execution begins 
and ends, and what happens to the original thread of 
control within the process. OpenThreads makes these 
points of control explicit by creating two functions that 
mark the beginning end ending of multithreaded exe- 
cution: ot_begin_mt and ot_end_mt, respectively. In 
between these calls the original process thread of con- 
trol, now called the process thread, is allowed to exe- 
cute any other code it desires, and is treated just like 
any other thread in the system. It can, for example, be 
blocked on a condition variable or be re-scheduled for 
execution on the run list. When the ot_end_mt call re- 
turns, the system is single-threaded again and another 
round of multithreaded execution may be initiated if 
desired. There are also functions for initializing the 
OpenThreads package (ot_init) and cleaning up af- 
ter the package (ot_done). Sample code for a process 
initiating multithreaded execution is given in Figure 4. 


3.2.2. The Meta-level Interface 


The OpenThreads meta interface (Figure 5) pro- 
vides the hooks needed to customize the design dilem- 
mas listed in Section 3.1. In most cases, these deci- 
sions are set up as events that trigger user-specified 
actions, or callback functions, to occur. For exam- 
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ple, the otm_push_callback routine allows the user 
to specify a callback function to be invoked whenever 
the specified event is triggered. As mentioned in Sec- 
tion 3.1.1, events can be either thread-specific, which 
occurs whenever any thread enters a specific state, or 
global, which occurs when the system itself enters a 
new state. The valid thread-specific events are thread 
entry, thread exit, thread save, thread restore, thread 
blocking, and thread unblocking. The valid global 
events are system idle begin, system idle spin, and 
system idle end. Callback functions for each event 
are maintained in a stack, so that multiple functions 
can be associated with each event. For example, this 
would allow a set of tracing functions to be installed 
atop a set of timing functions. All function for an event 
are called in stack order, and the otm_pop_callback 
routine provides a way of clearing or rearranging the 
callback stack of a thread at any time. 

The otm_install_queue function allows the user to 
provide implementations for the get and put func- 
tions of a thread list, as well as to define how large 
the link components of the thread list need to be. 
OpenThreads will invoke the get and put functions 
of a list whenever a scheduling decision needs to be 
made. This allows for multiple, user-defined thread 
lists to be used at the same time within OpenThreads, 
giving the user total control over scheduling. The in- 
terface allows different implementations to be associ- 
ated with different thread lists at the same time. 

The otm_define.switch routine allows the user to 
define the thread switching mode for a given thread 
(or for all threads). The valid switching modes are all, 
integer, or none, referring to the registers to be saved. 


3.2.3. The System-level Interface 


One of the key elements in the design of a multi- 
threaded runtime system is portability. Since parallel 
and concurrent languages execute in a wide variety of 
environments, their runtime systems must support a 
wide-range of platforms. 

To enable portability, we added a system-level in- 
terface (see Figure 6) to the traditional Open Imple- 
mentation design, resulting in the overall design of 
OpenThreads with three interfaces, as depicted in Fig- 
ure 7. The system interface provides a single place for 
mapping all dependencies that cannot be satisfied from 
within the OpenThreads implementation. 

These routines are then mapped (usually with sym- 
bolic constants) onto platform-specific routines that 
provide the necessary functionality. Therefore, a suc- 
cessful port of OpenThreads requires modification of 
exactly one header file in a clean and well-defined 
way. Note that the routines at this level are decid- 
edly low-level, so that additional overheads are not in- 
curred. The thread initialization and context switch- 
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void main (int argc, char *argv[]) 


{ 
ot_thread_t threads[NTJ; 
ot_queue_t runq; 


ot_init (&argc, argv, "ot_"); 
ot_queue_init (&runq); 
for (int i= 0; i < NT; i++) 
ot_thread_init (&threads[i], entry_func, arg, &runq) ; 


ot_begin_mt (&runq); 
/* == begin multithreaded execution == */ 


ot_end_mt (); 
/* == end multithreaded execution == */ 


ot_done (); 





Figure 4: Sample code for initiating multithreaded execution 


void otm_init (void); 

void otm_install_stackalloc (otm_sallocf_t *salloc, 
otm_sfreef_t *sfree); 

void otm_define_switch (otm_switch_mode_t, ot_thread_t *t); 

void otm_push_callback (int cbid, otm_callback_t *cbfunc) ; 

otm_callback_t *otm_pop_callback (int cbid) ; 

void otm_install_queue (ot_queue_t *q, unsigned qlink_size, 
unsigned qimp_size, otm_qinitf_t *init, 
otm_qgetf_t *get, otm_qputf_t *put) ; 





Figure 5: OpenThreads meta interface 


extern void *ots_stack_align (void *stack); 

extern ots_stack_t *ots_stack_pointer (void *storage, int size); 

extern ots_stack *ots_stack_init (ots_stack_t *sp, ots_userf_t *userf, 
void *userarg, ots_inif_t *initf, void *initarg); 

extern void *ots_switch_all (ots_helperf_t *helper, void *argi, 


void *arg2, ots_stack_t *new) ; 
extern void *ots_lock_init (ots_lock_t lock); 
extern int ots_lock_acquire (ots_lock_t lock); 
extern void void ots_lock_release (ots_lock_t lock); 





Figure 6: OpenThreads system interface 
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Figure 7: OpenThreads design 


ing routines intentionally mimic the QuickThreads [17] 
macros, which are an excellent example of how very- 
low-level thread support can be provided in a machine- 
independent manner. 


4 Discussion 


In this section we discuss our design in terms of our 
original design goals and in terms of open issues that 
have yet to be addressed or resolved. 


4.1 Design Goals 


The design of OpenThreads is based on three essential 
goals: flexibility, efficiency, and portability. Our hy- 
pothesis is that substrate software requires success for 
each goal, and that existing lightweight thread pack- 
ages fail in one or more of these goals. For exam- 
ple, one may argue that pthreads [16] is efficient and 
provides portability because its the “standard,” but 
it falls short in providing the flexibility demanded by 
most system-level programmers. For example, using 
pthreads it is impossible to trace thread execution 
when the scheduling policy is round-robin (preemp- 
tive). 

We now examine our Open Implementation of 
lightweight threads with respect to these design goals: 


1. Flexibility. 


The flexibility of our design is manifest by the 
meta-level interface, and the ability of a program- 
mer to provide her own solutions to the design 
dilemmas outlined in Section 3.1. As a concrete 
example of its flexibility, we have adapted the 
Panda multithreaded runtime system [7] to use 
OpenThreads for implementing its tasks. 


2. Efficiency. 


With regards to performance, we can report on 
the performance of OpenThreads in comparison 


USENIX Association 


POSIX Pthreads and the QuickThreads package 
upon which OpenThreads is built. The compar- 
ison with Pthreads demonstrates the efficiency 
of OpenThreads for doing basic multithreading. 
The comparison with QuickThreads shows how 
much overhead we’ve added to the underlying low- 
level switching routines. As a demonstration of 
portability, we present these tests on three differ- 
ent processor architectures: the MIPS R4400, the 
Sun SPARC, and the DEC ALPHA. The mea- 
surements are given in Tablel. All numbers were 
gathered using averages for multiple runs, with an 
initial untimed run used to reduce cache miss ef- 
fects. For context switches, OpenThreads saved 
all registers. 


A second set of experiments evaluates the flexi- 
bility of OpenThreads to support the Panda run- 
time system. ‘The hand-crafted Panda thread 
package was developed for the Amoeba oper- 
ating distributed system and runs on 50 MHz 
SPARC processors with 32 Mbyte of memory and 
4 Kbyte instruction and 2 Kbyte data caches (di- 
rect mapped). To achieve high performance the 
Panda scheduler is not built on top of Amoeba’s 
kernel threads, but it was derived from a user-level 
thread package developed at MIT by Wallach and 
Kaashoek. These results are presented in Table 2. 


Given the extensive tuning performed on the 
Panda thread package and the overhead of 
OpenThreads, we expected the generic Panda 
threads implemented with OpenThreads (Panda- 
OT) to perform less than the hand-crafted Panda 
threads. The performance results presented in 
Table 2, however, show that Panda-OT has the 
fastest context switch time on Amoeba: 40.2 ys 
(Panda) versus 35.4 ps (Panda-OT). Examina- 
tion of the low-level context switch code pro- 
vided by QuickThreads (used in OpenThreads) 
revealed that it contains a SPARC specific opti- 


1997 Annual Technical Conference 251 














QuickThreads STP 0.9 





MIPS R4400 
ctxsw | create | 


70,72 
[Pihreads | 288] WS 785] 0s — | — | 












105] 12 






Table 1: OpenThreads raw performance on various architectures; ctrswis the time in microseconds for a context 
switch and create is the time in microseconds to create a single thread with default stacksize (usually 8I<b). 
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Table 2: Performance of Panda thread packages on various platforms. 


mization that saves one register-window under- 
flow trap. Saving one trap to the Amoeba ker- 
nel accounts for approximately 7 us. Since we 
did not want to change Panda’s implementation 
for the sake of making this comparison, we have 
provided the empirical numbers instead. How- 
ever, the same optimization could be applied to 
the Panda threads, which would negate the per- 
formance advantage of OpenThreads for this ex- 
periment. In the end, we see about a 6% overhead 
for OpenThreads, which is a small price to pay for 
the added flexibility and portability. 


Unlike the context switch results, the results for 
thread creation show that Panda-OT performs 
poorly in comparison to native Panda; creating 
a thread with Panda-OT is more than twice as 
expensive as with native Panda. This large over- 
head, however, is not a consequence of the flex- 
ibility of OpenThreads, but of a slight differ- 
ence in scheduling policy of both thread packages. 
Panda-OT implements true priority scheduling, 
and therefore takes a scheduling decision as soon 
as the new thread is created and placed on the 
run queue. This results in a context switch to the 
newly created thread, which initializes itself and 
then terminates immediately. Once the thread 
has exited, another context switch occurs to trans- 
fer control back to the main thread. The Panda 
scheduler, on the other hand, does not take a 
scheduling decision until the main thread volun- 
tarily yields control. The difference in scheduling 
policy causes Panda-OT to incur two additional 
context switches, thus accounting for the perfor- 
mance difference. 
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Table 2 also contains performance numbers for 
Panda-OT on Solaris. In this case we compare 
Panda-OT to the original Panda implementation 
that uses native Solaris threads. The performance 
was measured on a 70 Mhz SPARCstation 4 with 
64 Mbof memory running Solaris 2.4. Panda-OT 
performs much better than native Panda: Panda- 
OT switches between threads over four times as 
fast and creates new threads over six times as fast. 
Since we do not have the source code of Solaris 
threads available, it is difficult to determine the 
exact causes of this great difference. We believe, 
however, that part of the explanation is Solaris 
ability to support a mixture of kernel and user 
threads. This functionality adds considerably to 
the complexity of the threads system, and requires 
expensive precautions to guard against preemp- 
tion at various levels. 


. Portability. 


Our system-level interface isolates all underlying 
dependencies in a single file that needs to be 
changed for each new platform. In some cases, 
there are no changes required at all. This is due 
to the fact that OpenThreads is currently map- 
ping its low-level thread operations onto Quick- 
Threads [17], which is already very portable. The 
Panda port runs the same code on both the So- 
laris and Amoeba platforms, and regular Orca 
programs were run to ensure the stability of the 
port. 


OpenThreads has been tested on the Sun Su- 
perSPARC and UltraSPARC architectures under 
both SunOS 4.1 and Solaris 2.5, the MIPS R4400 
architecture under IRIX 5.3, and the Alpha AXP 


USENIX Association 


USENIX Association 


architecture under DEC OSF-1. In addition, 
QuickThreads runs on the Intel 80x86, the Mo- 
torola 88000, the HP-PA, the KSR, and the VAX. 
As a result, OpenThreads can easily be ported to 
these architectures. 


4.2 Open Issues 


There are still a few open issues that remain in the 
design and implementation of OpenThreads. 


1. Thread identification is the task of determining 
which thread, or more specifically, which thread 
control block is currently active at any given time. 
One way to do this is to reserve a global register 
to hold this pointer. This is the approach taken 
by Solaris threads [28], and has the advantage of 
being very fast and not requiring global memory. 
However, compiler and architecture support are 
required to reserve this register, and the loss of 
a register on RISC-based architectures is always 
cause for concern. Another approach is to keep 
the current thread pointer on a well-known offset 
in each thread stack [6]. This approach eliminates 
the need for both global memory and register 
space, but requires fancy stack alignments. An- 
other approach, and the one currently employed 
by OpenThreads, is to use a global variable for 
storing the current thread pointer. This approach 
is the easiest to implement, does not require com- 
piler support for extra registers, and does not re- 
quire fancy stack manipulation. However, it does 
require per-processor global memory. A formal in- 
vestigation regarding the best method for thread 
identification is still an open issue. 


2. Kernel threads represent an opportunity to in- 
crease processor utilization in the face of blocking 
kernel calls, such as I/O. However, multiplexing 
user-level threads atop kernel threads requires a 
little finesse. We are in the process of revising 
OpenThreads to be safe in the presence of ker- 
nel threads. This requires protecting critical re- 
gions with kernel locks and identifying all global 
data as either shared among the kernel threads 
or private. Kernel thread private global data, 
such as the pointer to the current thread, needs 
to be stored as thread-specific data for each ker- 
nel thread. Powell et.al. [24] provide a discus- 
sion of this mapping for Solaris threads mapped 
onto LWPs. Another issue to be addressed in 
supporting kerne! threads is the impact of ker- 
nel thread decisions on the meta-level interface. 
For example, some kernel threads might support 
features that allow for better optimization of the 
user threads, such as upcalls. To what degree 


should these decisions be supported in the meta- 
interface? Kernel threads also raise the specter of 
multiprocessor issues, which we have completely 
ignored to this point. 


3. Multithreaded runtime systems require both 
thread and communication components, and both 
must work well together. We are in the process of 
examining the issues regarding the combination 
of threads with communication models, and plan 
to test the flexibility of our system for performing 
platform-independent optimizations using a com- 
bination of thread and communication modules. 
Other systems combining threads with communi- 
cation primitives include Chant [14], Nexus {12], 
PVM-threads [21], and MPI-threads [10]. 


4. Signals. Most papers on lightweight threads in- 
clude long and involved discussions about signal 
handling in the presence of threads. We will avoid 
this discussion for now, and simply state that 
OpenThreads currently exposes all threads to the 
same signal mask. We do, however, ensure that if 
a signal arrives while the system is in an atomic 
section, the signal handler is delayed until after 
the atomic section has been completed. 


d. Debugging. Debugging multithreaded systems 
has always been a trying experience because 
there are almost no debuggers that recognize the 
threads. OpenThreads does provide critical sup- 
port in this regard by allowing traces to be made 
for thread state transitions by installing print 
statements at the various thread-specific event 
points. However, more sophisticated debugging 
tools are clearly required. 


5 Related Research 


OpenThreads represents a novel approach to the de- 
sign of user-level threads, in which the user is given 
the opportunity to change the behavior of high-level 
abstractions in a well-defined manner. Many thread 
packages, such as pthreads [16], support an exten- 
sive user-interface with some behavior-modif ying com- 
mands intertwined (such as attribute specification for 
threads). However, these systems do not take a sys- 
tematic approach to exposing the critical design dilem- 
mas and, as a result, fall short in providing the flexi- 
bility required by most system-level programmers. 
QuickThreads [17] is a thread-building toolkit that 
offers platform independent micro-instructions for 
managing thread stacks. QuickThreads is similar to 
assembly language programming in terms of flexibil- 
ity, speed, and complexity. OpenThreads builds on 
the QuickThreads design philosophy of keeping things 
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simple, and provides high-level abstractions whose be- 
havior can be modified by the user in a well-defined 
manner. 

The initial description of Open Implementation 
Analysis and Design [18] provided the motivation for 
much of this work. However, the initial description 
fails to talk about portability concerns. As a result, 
we extended the design to include a new system-level 
interface that unifies and defines all system dependen- 
cies. 


6 Conclusions 


It would seem that the last thing we need these days 
is another user-level thread package. From the stand- 
point of an application-level programmer I would have 
to agree. However, from the standpoint of a system- 
level programmer building multithreaded runtime sys- 
tems, I would disagree. The evidence suggests that 
none of the current thread packages are being widely 
used by system-level programmers. In this paper we 
introduce the design of a user-level thread package for 
substrate software. The idea here is to identify all of 
the crucial design dilemmas that occur in building a 
thread package, and provide a clean and well-defined 
way for users to change these decisions. The result is 
a thread package with a simple user interface and a 
powerful meta interface for changing the behavior of 
the abstractions defined by the user interface. A sys- 
tem interface should also be used to isolate and define 
all underlying dependencies. 

We have designed and built OpenThreads as a proof- 
of-concept for the ideas outlined in this paper, and 
are in the process of adapting several multithreaded 
runtime systems to use OpenThreads. We will report 
on the success of these attempts in the future. 
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Abstract 


Modern switched networks such as ATM and Myri- 
net enable low-latency, high-bandwidth communica- 
tion. This performance has not been realized by cur- 
rent applications, because of the high processing over- 
heads imposed by existing communications software. 
These overheads are usually not hidden with large 
packets; most network traffic is small. We have devel- 
oped Fast Sockets, a local-area communication layer 
that utilizes a high-performance protocol and exports 
the Berkeley Sockets programming interface. Fast 
Sockets realizes round-trip transfer times of 60 mi- 
croseconds and maximum transfer bandwidth of 33 
MB/second between two UltraSPARC Is connected 
by a Myrinet network. Fast Sockets obtains perfor- 
mance by collapsing protocol layers, using simple 
buffer management strategies, and utilizing knowl- 
edge of packet destinations for direct transfer into user 
buffers. Using receive posting, we make the Sockets 
API a single-copy communications layer and enable 
regular Sockets programs to exploit the performance 
of modern networks. Fast Sockets transparently re- 
verts to standard TCP/IP protocols for wide-area com- 
munication. 
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1 Introduction 


The development and deployment of high perfor- 
mance local-area networks such as ATM [de Prycker 
1993], Myrinet [Seitz 1994], and switched high- 
speed Ethernet has the potential to dramatically im- 
prove communication performance for network ap- 
plications. These networks are capable of microsec- 
ond latencies and bandwidths of hundreds of megabits 
per second; their switched fabrics eliminate the con- 
tention seen on shared-bus networks such as tradi- 
tional Ethernet. Unfortunately, this raw network ca- 
pacity goes unused, due to current network commu- 
nication sottware. 


Most of these networks run the TCP/IP proto- 
col suite [Postel 1981b, Postel 1981c, Postel 1981a). 
TCP/IP is the default protocol suite for Internet traf- 
fic, and provides inter-operability among a wide va- 
riety of computing platforms and network technolo- 
gies. The TCP protocol provides the abstraction of 
a reliable, ordered byte stream. The UDP protocol 
provides an unreliable, unordered datagram service. 
Many other application-level protocols (such as the 
FTP file transfer protocol, the Sun Network File Sys- 
tem, and the X Window System) are built upon these 
two basic protocols. 


TCP/IP’s observed performance has not scaled to 
the ability of modern network hardware, however. 
While TCP is capable of sustained bandwidth close 
to the rated maximum of modern networks, actual 
bandwidth is very much implementation-dependent. 
Theround-trip latency of commercial TCP implemen- 
tations is hundreds of microseconds higher than the 
minimum possible on these networks. Implementa- 
tions of the simpler UDP protocol, which lack the reli- 
ability and ordering mechanisms of TCP, perform lit- 
tle better [Keeton et al. 1995, Kay & Pasquale 1993, 
von Eicken et al. 1995]. This poor performance is 
due to the high per-packet processing costs (process- 
ing overhead) of the protocol implementations. In 
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local-area environments, where on-the-wire times are 
small, these processing costs dominate small-packet 
round-trip latencies. 

Local-area traffic patterns exacerbate the problems 
posed by high processing costs. Most LAN traf- 
fic consists of small packets [Kleinrock & Naylor 
1974, Schoch & Hupp 1980, Feldmeier 1986, Amer 
et al. 1987, Cheriton & Williamson 1987, Gusella 
1990, Caceres et al. 1991, Claffy et al. 1992). 
Small packets are the rule even in applications con- 
sidered bandwidth-intensive: 95% of all packets in a 
NFS trace performed at the Berkeley Computer Sci- 
ence Division carried less than 192 bytes of user data 
(Dahlin et al. 1994]; the mean packet size in the trace 
was 382 bytes. Processing overhead is the dominant 
transport cost for packets this small, limiting NFS per- 
formance on a high-bandwidth network. This is true 
for other applications: most application-level proto- 
cols in local-area use today (X11, NFS, FTP, etc.) op- 
erate in a request-response, or client-server, manner: 
a client machine sends a small request message to a 
server, and awaits a response from the server. In the 
Tequest-response model, processing overhead usually 
cannot be hidden through packet pipelining or through 
overlapping communication and computation, mak- 
ing round-trip latency a critical factor in protocol per- 
formance. 

Traditionally, there are several methods of attack- 
ing the processing overhead problem: changing the 
application programming interface (API), changing 
the underlying network protocol, changing the im- 
plementation of the protocol, or some combination 
of these approaches. Changing the AP/ modifies the 
code used by applications to access communications 
functionality. While this approach may yield bet- 
ter performance for new applications, legacy appli- 
cations must be re-implemented to gain any benefit. 
Changing the communications protocol changes the 
“on-the-wire” format of data and the actions taken 
during a communications exchange — for example, 
modifying the TCP packet format. A new or modified 
protocol may improve communications performance, 
but at the price of incompatibility: applications com- 
municating via the new protocol are unable to share 
data directly with applications using the old protocol. 
Changing the protocol implementation rewrites the 
software that implements a particular protocol; packet 
formats and protocol actions do not change, but the 
code that performs these actions does. While this ap- 
proach provides full compatibility with existing pro- 
tocols, fundamental limitations of the protocol design 
may limit the performance gain. 

Recent systems, such as Active Messages [von 
Eicken et al. 1992], Remote Queues [Brewer et al. 
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1995], and native U-Net [von Eicken et al. 1995], 
have used the first two methods; they implement new 
protocols and new programming interfaces to obtain 
improved local-area network performance. The pro- 
tocols and interfaces are lightweight and provide pro- 
gramming abstractions that are similar to the under- 
lying hardware. All of these systems realize laten- 
cies and throughput close to the physical limits of the 
network. However, none of them offer compatibility 
with existing applications. 

Other work has tried to improveperformance by re- 
implementing TCP. Recent work includes zero-copy 
TCP for Solaris [Chu 1996] and a TCP interface for 
the U-Net interface [von Eicken et al. 1995]. These 
implementations can inter-operate with other TCP/IP 
implementations and improve throughput and latency 
relative to standard TCP/IP stacks. Both implementa- 
tions can realize the full bandwidth of the network for 
large packets. However, both systems have round-trip 
latencies considerably higher than the raw network. 

This paper presents our solution to the overhead 
problem: a new communications protocol and im- 
plementation for local-area networks that exports the 
Berkeley Sockets API, uses a low-overhead protocol 
for local-area use, and reverts to standard protocols 
for wide-area communication. The Sockets API is 
a widely used programming interface that treats net- 
work connections as files; application programs read 
and write network connections exactly as they read 
and write files. The Fast Sockets protocol has been 
designed and implemented to obtain a low-overhead 
data transmission/reception path. Should a Fast Sock- 
ets program attempt to connect with a program out- 
side the local-area network, or to a non-Fast Sockets 
program, the software transparently reverts to stan- 
dard TCP/IP sockets. These features enable high- 
performance communication through relinking exist- 
ing application programs. 

Fast Sockets achieves its performance through a 
number of strategies. It uses a lightweight protocol 
and efficient buffer management to minimize book- 
keeping costs. The communication protocol and 
its programming interface are integrated to elimi- 
nate module-crossing costs. Fast Sockets eliminates 
copies within the protocol stack by using knowledge 
of packet memory destinations. Additionally, Fast 
Sockets was implemented without modifications to 
the operating system kemel. 

A major portion of Fast Sockets’ performance is 
due to receive posting, a technique of utilizing infor- 
mation from the API about packet destinations to min- 
imize copies. This paper describes the use of receive 
posting in a high-performance communications stack. 
It also describes the design and implementation of a 
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low-overhead protocol for local-area communication. 

The rest of the paper is organized as follows. Sec- 
tion 2 describes problems of current TCP/IP and 
Sockets implementations and how these problems af- 
fect communication performance. Section 3 describes 
how the Fast Sockets design attempts to overcome 
these problems. Section 4 describes the performance 
of the resultant system, and section 5 compares Fast 
Sockets to other attempts to improve communication 
performance. Section 6 presents our conclusions and 
directions for future work in this area. 


2 Problems with TCP/IP 


While TCP/IP can achieve good throughput on cur- 
rently deployed networks, its round-trip latency is 
usually poor. Further, observed bandwidth and round- 
trip latencies on next-generation network technolo- 
gies such as Myrinet and ATM do not begin to ap- 
proach the raw capabilities of these networks [Keeton 
et al. 1995]. In this section, we describe a number of 
features and problems of commercial TCP implemen- 
tations, and how these features affect communication 
performance. 


2.1 Built for the Wide Area 


TCP/IP was originally designed, and is usually im- 
plemented, for wide-area networks. While TCP/IP 
is usable on a local-area network, it is not optimized 
for this domain. For example, TCP uses an in-packet 
checksum for end-to-end reliability, despite the pres- 
ence of per-packet CRC’s in most modern network 
hardware. But computing this checksum is expensive, 
creating a bottleneck in packet processing. IP uses 
header fields such as ‘Time-To-Live’ which are only 
relevant in a wide-area environment. IP also supports 
internetwork routing and in-flight packet fragmenta- 
tion and reassembly, features which are not useful in 
alocal-area environment. The TCP/IP model assumes 
communication between autonomous machines that 
cooperate only minimally. However, machines on a 
local-area network frequently share acommon admin- 
istrative service, a common file system, and a com- 
mon user base. It should be possible to extend this 
commonality and cooperation into the network com- 
munication software. 


2.2 Multiple Layers 


Standard implementations of the Sockets interface 
and the TCP/IP protocol suite separate the protocol 
and interface stack into multiple layers. The Sockets 


interface is usually the topmost layer, sitting above the 
protocol. The protocol layer may contain sub-layers: 
for example, the TCP protocol code sits above the 
IP protocol code. Below the protocol layer is the in- 
terface layer, which communicates with the network 
hardware. The interface layer usually has two por- 
tions, the network programming interface, which pre- 
pares outgoing data packets, and the network device 
driver, which transfers data to and from the network 
interface card (NIC). 

This multi-layer organization enables proto- 
col stacks to be built from many combinations of 
protocols, programming interfaces, and network 
devices, but this flexibility comes at the price of 
performance. Layer transitions can be costly in 
time and programming effort. Each layer may use 
a different abstraction for data storage and transfer, 
requiring data transformation at every layer bound- 
ary. Layering also restricts information transfer. 
Hidden implementation details of each layer can 
cause large, unforeseen impacts on performance 
(Clark 1982, Crowcroft et al. 1992]. Mechanisms 
have been proposed to overcome these difficulties 
(Clark & Tennenhouse 1990], but existing work has 
focused on message throughput, rather than protocol 
latency [Abbott & Peterson 1993]. Also, the number 
of programming interfaces and protocols is small: 
there are two programming interfaces (Berkeley 
Sockets and the System V Transport Layer Interface) 
and only a few data transfer protocols (TCP/IP 
and UDP/IP) in widespread usage. This paucity of 
distinct layer combinations means that the generality 
of the multi-layer organization is wasted. Reducing 
the number of layers traversed in the communications 
stack should reduce or eliminate these layering costs 
for the common case of data transfer. 


2.3 Complicated Memory Management 


Current TCP/IP implementations use a complicated 
memory management mechanism. This system ex- 
ists for a number of reasons. First, a multi-layered 
protocol stack means packet headers are added (or re- 
moved) as the packet moves downward (or upward) 
through the stack. This should be done easily and ef- 
ficiently, without excessive copying. Second, buffer 
memory inside the operating system kernel is a scarce 
resource; it must be managed in a space-efficient fash- 
ion. This is especially true for older systems with lim- 
ited physical memory. 

To meet these two requirements, mechanisms such 
as the Berkeley Unix mbuf havebeenused. Anmbuf 
can directly hold a small amount of data, andmbufs 
can be chained to manage larger data sets. Chain- 


1997 Annual Technical Conference 


259 





260 


ing makes adding and removing packet headers easy. 
The mbuf abstraction is not cheap, however: 15% 
of the processing time for small TCP packets is con- 
sumed by mbu f management [Kay & Pasquale 1993]. 
Additionally, to take advantage of the mbuf abstrac- 
tion, user data must be copied into and out of mbufs, 
which consumes even more time in the data transfer 
critical path. This copying means that nearly one- 
quarter of the small-packet processing time in a com- 
mercial TCP/IP stack is spent on memory manage- 
ment issues. Reducing the overhead of memory man- 
agement is therefore critical to improving communi- 
cations performance. 


3 Fast Sockets Design 


Fast Sockets is an implementation of the Sock- 
ets API that provides high-performance communi- 
cation and inter-operability with existing programs. 
It yields high-performance communication through 
a low-overhead protocol layered on top of a low- 
overhead transport mechanism (Active Messages). 
Interoperability with existing programs is obtained by 
supporting most of the Sockets API and transparently 
using existing protocols for communication with non- 
Fast Sockets programs. In this section, we describe 
the design decisions and consequent trade-offs of Fast 
Sockets. 


3.1 Built For The Local Area 


Fast Sockets is targeted at local-area networks of 
workstations, where processing overhead is the pri- 
mary limitation on communications performance. For 
low-overhead access to the network with a defined, 
portable interface, Fast Sockets uses Active Messages 
[von Eicken et al. 1992, Martin 1994, Culler et al. 
1994, Mainwaring & Culler 1995]. An active mes- 
Sage is a network packet which contains the name of 
a handler function and data for that handler. When an 
active message arrives at its destination, the handler 
is looked up and invoked with the data carried in the 
message. While conceptually similar to a remote pro- 
cedure call [Birrell & Nelson 1984], an active mes- 
sage is constrained in the amount and types of data 
that can be carried and passed to handler functions. 
These constraints enable the structuring of an Active 
Messages layer for high performance. Also, Active 
Messages uses protected user-level access to the net- 
work interface, removing the operating system kernel 
from the critical path. Active messages are reliable, 
but not ordered. 

Using Active Messages as a network transport in- 
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volves a number of trade-offs. Active Messages has 
its own ‘on-the-wire’ packet format; this makes a full 
implementation of TCP/IP on Active Messages infea- 
sible, as the transmitted packets will not be compre- 
hensible by other TCP/IP stacks. Instead, we elected 
to implement our own protocol for local area com- 
munication, and fall back to normal TCP/IP for wide- 
area communication. Active Messages operates pri- 
marily at user-level; although access to the network 
device is granted and revoked by the operating system 
kernel, data transmission and reception is performed 
by user-level code. For maximum performance, Fast 
Sockets is written as a user-level library. While this 
organization avoids user-kernel transitions on com- 
munications events (data transmission and reception), 
it makes the maintenance of shared and global state, 
such as the TCP and UDP port name spaces, difficult. 
Some global state can be maintained by simply using 
existing facilities. For example, the port name spaces 
can use the in-kernel name management functions. 
Other shared or global state can be maintained by us- 
ing a server process to store this state, as described 
in (Maeda & Bershad 1993b]. Finally, using Active 
Messages limits Fast Sockets communication to the 
local-area domain. Fast Sockets supports wide-area 
communication by automatically switching to stan- 
dard network protocols for non-local addresses. It is 
a reasonable trade-off, as endpoint processing over- 
heads are generally not the limiting factor for internet- 
work communication. 


Active Messages does have a number of benefits, 
however. The handler abstraction is extremely use- 
ful. A handler executes upon message reception at 
the destination, analogous to a network interrupt. At 
this time, the protocol can store packet data for later 
use, pass the packet data to the user program, or deal 
with exceptional conditions. Message handlers allow 
for a wide variety of control operations within a pro- 
tocol without slowing down the critical path of data 
transmission and reception. Handler arguments en- 
able the easy separation of packet data and metadata: 
packet data (that is, application data) is carried as a 
bulk transfer argument, and packet metadata (proto- 
col headers) are carried in the remaining word-sized 
arguments. This is only possible if the headers can 
fit into the number of argument words provided; for 
our local-area protocol, the 8 words supplied by Ac- 
tive Messages is sufficient. 

Fast Sockets further optimizes for the local area by 
omitting features of TCP/IP unnecessary in that envi- 
ronment. For example, Fast Sockets uses the check- 
sum or CRC of the network hardware instead of one 
in the packet header; software checksums make little 
sense when packets are only traversing a single net- 
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work. Fast Sockets has no equivalent of IP’s ‘Time- 
To-Live’ field, or IP’s internetwork routing support. 
Since maximum packet sizes will not change within a 
local area network, Fast Sockets does not support IP- 
style in-flight packet fragmentation. 


3.2 Collapsing Layers 


To avoid the performance and structuring problems 
created by the multi-layered implementation of Unix 
TCP/IP, Fast Sockets collapses the API and protocol 
layers of the communications stack together. This 
avoids abstraction conflicts between the programming 
interface and the protocol and reduces the number 
of conversions between layer abstractions, further re- 
ducing processing overheads. 

The actual network device interface, Active Mes- 
sages, remains a distinct layer from Fast Sockets. 
This facilitates the portability of Fast Sockets between 
different operating systems and network hardware. 
Active Messages implementations are available for 
the Intel Paragon [Liu & Culler 1995], FDDI [Mar- 
tin 1994], Myrinet [Mainwaring & Culler 1995], and 
ATM [von Eicken et al. 1995]. Layering costs are 
kept low because Active Messages is a thin layer, and 
all of its implementation-dependentconstants (such as 
maximum packet size) are exposed to higher layers. 

The Fast Sockets layer stays lightweight by exploit- 
ing Active Message handlers. Handlers allow rarely- 
used functionality, such as connection establishment, 
to be implemented without affecting the critical path 
of data transmission and reception. There is no need 
to test for uncommon events when a packet arrives — 
this is encoded directly in the handler. Unusual data 
transmission events, such as out-of-band data, also 
use their own handlers to keep normal transfer costs 
low. 

Reducing the number of layers and exploiting Ac- 
tive Message handlers lowers the protocol- and API- 
specific costs of communication. While collapsing 
layers means that every protocol layer-API combina- 
tion has to be written anew, the number of such com- 
binations is relatively few, and the number of distinct 
operations required for each API is small. 


3.3. Simple Buffer Management 


Fast Sockets avoids the complexities of mbuf-style 
memory management by using a single, contiguous 
virtual memory buffer foreach socket. Data is trans- 
ferred directly into this buffer via Active Message 
data transfer messages. The message handler places 
data sequentially into the buffer to maintain in-order 
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Figure |: Data transfer in Fast Sockets. A send () 
call transmits the data directly from the user buffer 
into the network. When it arrives at the remote des- 
tination, the message handler places it into the socket 
buffer, and a subsequent recv () call copies it into 
the user buffer. 


delivery and make data transfer to a user buffer a sim- 
ple memory copy. The argument words of the data 
transfer messages carry packet metadata; because the 
argument words are passed separately to the handler, 
there is no need for the memory management system 
to strip off packet headers. 


Fast Sockets eliminates send buffering. Because 
many user applications rely heavily on small pack- 
ets and on request-response behavior, delaying packet 
transmission only serves to increase user-visible la- 
tency. Eliminating send-side buffering reduces proto- 
col overhead because there are no copies on the send 
side of the protocol path — Active Messages already 
provides reliability. 


Figure 1 shows Fast Sockets’ send mechanism and 
buffering techniques. 


A possible problem with this approach is that hav- 
ing many Fast Sockets consumes considerably more 
memory than the global mbuf pool used in traditional 
kernel implementations. This is not a major concern, 
for two reasons. First, the memory capacity of current 
workstations is very large; the scarce physical mem- 
Ory situations that the traditional mechanisms are de- 
signed for is generally not a problem. Second, the 
socket buffers are located in pageable virtua) mem- 
ory — if memory and scheduling pressures are severe 
enough, the buffer can be paged out. Although pag- 
ing out the buffer will lead to worse performance, we 
expect that this is an extremely rare occurrence. 


A more serious problem with placing socket buffers 
in user virtual memory is that it becomes extremely 
difficult to share the socket buffer between processes. 
Such sharing can arise due toa fork () call, for in- 
stance. Currently, Fast Sockets cannot be shared be- 
tween processes (see section 3.7). 
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Figure 2: Data transfer via receive posting. If a 
recv () call is issued prior to the arrival of the de- 
sired data, the message handler can directly route the 
data into the user buffer, bypassing the socket buffer. 


3.4 Copy Avoidance 


The standard Fast Sockets data transfer mechanism 
involves two copies along the receive path, from the 
network interface to the socket buffer, and then from 
the socket buffer to the user buffer specified in a 
recv() call. While the copy from the network in- 
terface cannot be avoided, the second copy increases 
the processing overhead of a data packet, and con- 
sequently, round-trip latencies. A high-performance 
communications layer should bypass this second copy 
whenever possible. 

It is possible, under certain circumstances, to avoid 
the copy through the socket buffer. If the data’s final 
memory destination is already known upon packet ar- 
rival (through a recv () call), the data can be directly 
copied there. We call this technique receive posting. 
Figure 2 shows how receive posting operates in Fast 
Sockets. If the message handler determines that an 
incoming packet will satisfy an outstanding recv () 
call, the packet’s contents are received directly into 
the user buffer. The socket buffer is never touched. 

Receive posting in Fast Sockets is possible because 
of the integration of the API and the protocol, and the 
handler facilities of Active Messages. Protocol-API 
integration allows knowledge of the user’s destina- 
tion buffer to be passed down to the packet process- 
ing code. Using Active Message handlers means that 
the Fast Sockets code can decide where to place the 
incoming data when it arrives. 


3.5 Design Issues In Receive Posting 


Many high-performance communications systems are 
now using sender-based memory management, where 
the sending node determines the data’s final mem- 
ory destination. Examples of communications lay- 
ers with this memory management style are Hamlyn 
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(Buzzard et al. 1996] and the SHRIMP network in- 
terface [Blumrich et al. 1994]. Also, the initial ver- 
sion of Generic Active Messages [Culler et al. 1994] 
offered only sender-based memory management for 
data transfer. 

Sender-based memory management has a major 
drawback for use in byte-stream and message-passing 
APIs such as Sockets. With sender-based manage- 
ment, the sending and receiving endpoints must syn- 
chronize and agree on the destination of each packet. 
This synchronization usually takes the form of a mes- 
sage exchange, which imposes a time cost and a 
resource cost (the message exchange uses network 
bandwidth). To minimize this synchronization cost, 
the original version of Fast Sockets used the socket 
buffier as a default destination. This meant that when 
a recv() call was made, data already in flight had 
to be directed through the socket buffer, as shown in 
Figure |. These synchronization costs lowered Fast 
Sockets’ throughput relative to Active Messages’ on 
systems where the network was not a limiting factor, 
such as the Myrinet network. 

Generic Active Messages 1.5 introduced anew data 
transfer message type that did not require sender- 
based memory management. This message type, the 
medium message, transferred data into an anonymous 
region of user memory and then invoked the handler 
with a pointer to the packet data. The handler was 
responsible for long-term data storage, as the buffer 
was deallocated upon handler completion. Keeping 
memory management on the receiver improves Fast 
Sockets performance considerably. Synchronization 
is now only required between the API and the pro- 
tocol layers, which is simple due to their integration. 
Thereceive handler now determines the memory des- 
tination of incoming data at packet arrival time, en- 
abling a recv() of in-flight data to benefit from re- 
ceive posting and bypass the socket buffer. The net 
result is that receiver-based memory management did 
not significantly affect Fast Sockets’ round-trip laten- 
cies and improved large-packet throughput substan- 
tially, to within 10% of the throughput of raw Active 
Messages. 

The use of receiver-based memory management 
has some trade-offs relative to sender-based systems. 
In atrue zero-copy transport layer, where packet data 
is transferred via DMA to user memory, transfer to 
an anonymous region can place an extra copy into the 
receive path. This is not a problem for two reasons. 
First, many current I/O architectures, like that on the 
SPARC, are limited in the memory regions that they 
can perform DMA operations to. Second, DMA oper- 
ations usually require memory pages to be pinned in 
physical memory, and pinning an arbitrary page can 
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be an expensive operation. For these reasons, cur- 
rent versions of Active Messages move data from the 
network interface into an anonymous staging area be- 
fore either invoking the message handler (for medium 
messages) or placing the data in its final destination 
(for standard data transfer messages). Consequently, 
receiver-based memory management does not impose 
a large cost in ourcurrentsystem. Fora true zero-copy 
system, it should be possible for a handler to be re- 
sponsible for moving data from the network interface 
card to user memory. 

Based on our experiences implementing Fast Sock- 
ets with both sender- and receiver-based memory 
management schemes, we believe that, for messaging 
layers such as Sockets, the performance increase de- 
livered by receiver-based management schemes out- 
weigh the implementation costs. 


3.6 Other Fast Socket Operations 


Fast Sockets supports the full sockets API, includ- 
ing socket creation, name management, and connec- 
tion establishment. The socket call for the for the 
AF_INET address family and the “default protocol’ 
creates a Fast Socket; this allows programs to explic- 
itly request TCP or UDP sockets in a standard way. 
Fast Sockets utilizes existing name management fa- 
cilities. Every Fast Socket has a shadow socket as- 
sociated with it; this shadow socket is of the same 
type and shares the same file descriptor as the Fast 
Socket. Whenever a Fast Socket requests to bind ( ) 
to an AF_INET address, the operation is first per- 
formed on the shadow socket to determine if the oper- 
ation is legal and the name is available. If the shadow 
socket bind succeeds, the AF_INET name is bound 
to the Fast Socket’s Active Message endpoint name 
via an external name service. Other operations, such 
as setsockopt() and getsockopt (), work in 
a similar fashion. 

Shadow sockets are also used for connection es- 
tablishment. When a connect () call is made by 
the user application, Fast Sockets determines if the 
Name is in the local subnet. For local addresses, 
the shadow socket performs a connect() to a 
port number that is determined through a hash func- 
tion. If this connect () succeeds, then a handshake 
is performed to bootstrap the connection. Should 
the connect () to the hashed port number fail, or 
the connection bootstrap process fail, then a normal 
connect () call is performed. This mechanism al- 
lows a Fast Sockets program to connect to a non-Fast 
Sockets program without difficulties. 

An accept () call becomes more complicated as 
a result of this scheme, however. Because both Fast 


Sockets and non-Fast Sockets programs can connect 
to a socket that has performed a listen() call, 
there two distinct port numbers for a given socket. 
The port supplied by the user accepts connection re- 
quests from programs using normal protocols. The 
second port is derived from hashing on the user port 
number, and is used to accept connection requests 
from Fast Sockets programs. An accept () call 
multiplexes connection requests from bothports. 
The connection establishment mechanism has 
some trade-offs. It utilizes existing name and 
connection management facilities, minimizing the 
amount of code in Fast Sockets. Using TCP/IP to 
bootstrap the connection can impose a high time cost, 
which limits the realizable throughput of short-lived 
connections. Using two port numbers also introduces 
the potential problem of conflicts: the Fast Sockets- 
generated port number could conflict with a user port. 
We do not expect this to realistically be a problem. 


3.7. Fast Sockets Limitations 


Fast Sockets is a user-level library. This limits its full 
compatibility with the Sockets abstraction. First, ap- 
plications must be relinked to use Fast Sockets, al- 
though no code changes arerequired. Moreseriously, 
Fast Sockets cannot currently be shared between two 
processes (for example, via a fork () call), and all 
Fast Sockets state is lost upon an exec () orexit () 
call. This poses problems for traditional Internet 
server daemons and for “super-server”’ daemons such 
as inetd, which depend on a fork() for each 
imcoming request. User-level operation also causes 
problems for socket termination; standard TCP/IP 
sockets are gracefully shut down on process termina- 
tion. These problems are not insurmountable. Shar- 
ing Fast Sockets requires an Active Messages layer 
that allows endpoint sharing and either the use of a 
dedicated server process [Maeda & Bershad 1993b] 
or the use of shared memory for every Fast Socket’s 
state. Recovering Fast Sockets state lost during an 
exec () call can be done via a dedicated server pro- 
cess, where the Fast Socket’s state is migrated to and 
from the server before and after the exec ( ) — sim- 
ilar to the method used by the user-level Transport 
Layer Interface [Stevens 1990]. 

The Fast Sockets library is currently single- 
threaded. This is problematic for current versions of 
Active Messages because an application must explic- 
itly touch the network to receive messages. Since a 
user application could engage in an arbitrarily long 
computation, it is difficult to implement operations 
such as asynchronous I/O. While multi-threading 
offers one solution, it makes the library less portable, 
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and imposes synchronization costs. 


4 Performance 


This section presents our performance measurements 
of Fast Sockets, both for microbenchmarks and a few 
applications. The microbenchmarks assess the suc- 
cess of our efforts to minimize overhead and round- 
trip latency. We report round-trip times and sustained 
bandwidth available from Fast Sockets, using both 
our own microbenchmarks and two publicly available 
benchmark programs. The true test of Fast Sockets’ 
usefulness, however, is how well its raw performance 
is exposed to applications. We present results from an 
FTP application to demonstrate the usefulness of Fast 
Sockets in a real-world environment. 


4.1 Experimental Setup 


Fast Sockets hasbeen implemented using Generic Ac- 
tive Messages (GAM) [Culler et al. 1994] on both 
HP/UX 9.0.x and Solaris 2.5. The HP/UX platform 
consists of two 99Mhz HP735’s interconnected by an 
FDDI network and using the Medusa network adapter 
(Banks & Prudence 1993]. The Solaris platform is a 
collection of UltraSPARC 1’s connected via a Myri- 
net network [Seitz 1994]. For all tests, there was no 
other load on the network links or switches. 

Our microbenchmarks were run against a variety of 
TCP/IP setups. The standard HP/UX TCP/IP stack is 
well-tuned, but there is also a single-copy stack de- 
signed for use on the Medusa network interface. We 
ran ourtests on bothstacks. While the Solaris TCP/IP 
stack has reasonably good performance, the Myrinet 
TCP/IP drivers do not. Consequently, we also ran our 
microbenchmarks on a 100-Mbit Ethernet, which has 
an extremely well-tuned driver. 

We used Generic Active Messages 1.0 on the 
HP/UX platform as a base for Fast Sockets and as our 
Active Messages layer. The Solaris tests used Generic 
Active Messages 1.5, which adds support for medium 
Messages (receiver-based memory management) and 
client-server program images. 


4.2. Microbenchmarks 
4.2.1 Round-Trip Latency 


Our round-trip microbenchmark is a simple ping- 
pong test between two machines for a given transfer 
size. The ping-pong is repeated until a 95% confi- 
dence interval is obtained. TCP/IP and Fast Sock- 
ets use the same program, Active Messages uses dif- 
ferent code but the same algorithm. We used the 
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Small Packets 
















Fast Sockets 
Active Messages 
| TCPAP (Myrinet) 
| TCP/IP (Fast Ethemet) 
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Table 1: Least-squares analysis of the Solaris round- 
trip microbenchmark. The per-byte and estimated 
startup (fo) costs are for round-trip latency, and are 
Measured in microseconds. The actual startup costs 
(for a single-byte message) are also shown. Ac- 
tual and estimated costs differ because round-trip la- 
tency is not strictly linear. Per-byte costs for Fast 
Sockets are lower than for Active Messages because 
Fast Sockets benefits from packet pipelining; the Ac- 
tive Messages test only sends a single packet at a 
time. “Small Packets” examines protocol behavior for 
packets smaller than 1K; here, Fast Sockets and Ac- 
tive Messages do considerably better than TCP/IP. 


TCP/IP TCP_NODELAY option to force packets to be 
transmitted as soon as possible (instead of the de- 
fault behavior, which attempts to batch small pack- 
ets together); this reduces throughput for small trans- 
fers, but yields better round-trip times. The socket 
buffer size was set to 64 Kbytes!, as this also improves 
TCP/IP round-trip latency. We tested TCP/IP and Fast 
Sockets for transfers up to 64 Kbytes; Active Mes- 
sages only supports 4 Kbyte messages. 

Figures 3 and 4 present the results of the round- 
trip microbenchmark. Fast Sockets achieves low la- 
tency round-trip times, especially for small packets. 
Round-trip time scales linearly with increasing trans- 
fer size, reflecting the time spent in moving the data 
to the network card. There is a “hiccup” at 4096 bytes 
on the Solaris platform, which is the maximum packet 
size for Active Messages. This occurs because Fast 
Sockets’ fragmentation algorithm attempts to balance 
packet sizes for better round-trip and band width char- 
acteristics. Fast Sockets’ overhead is low, staying rel- 
atively constant at 25~30 microseconds (over that of 
Active Messages). 

Table 1 shows the results of a least-squares lin- 
ear regression analysis of the Solaris round-trip mi- 
crobenchmark. We show 1, the estimated cost for a 


ITCP/IP is limited to a maximum buffer size of 56 Kbytes. 
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Figure 3: Round-trip performance for Fast Sockets, Active Messages, and TCP for the HP/UX/Medusa platform. 
Active Messages for HP/UX cannot transfer more than 8140 bytes at a time. 
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Figure 4: Round-trip performance for Fast Sockets, Active Messages, and TCP on the Solaris/Myrinet platform. 
Active Messages for Myrinet can only transfer up to 4096 bytes at a time. 
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0-byte packet, and the marginal cost for each addi- 
tional data byte. Surprisingly, Fast Sockets’ cost-per- 
byte appears to be lower than that of Active Messages. 
This is because the Fast Sockets per-byte cost is re- 
ported for a 64K range of transfers while Active Mes- 
sages’ per-byte cost is for a 4K range. The Active 
Messages test minimizes overhead by not implement- 
ing in-order delivery, which means only one packet 
can be outstanding at a time. Both Fast Sockets and 
TCP/IP provide in-order delivery, which enables data 
packets to be pipelined through the network and thus 
achieve a lower per-byte cost. The per-byte costs of 
Fast Sockets for small packets (less than 1 Kbyte) are 
slightly higher than Active Messages. While TCP’s 
long-term per-byte cost is only about 15% higher than 
that of Fast Sockets, its performance for small packets 
is much worse, with per-byte costs twice that of Fast 
Sockets and startup costs 5-10 times higher. 

Another surprising element of the analysis is that 
the overall tg and fg for small packets is very differ- 
ent, especially for Fast Sockets and Myrinet TCP/IP. 
Both protocols pipeline packets, which lowers round- 
trip latencies for multi-packet transfers. This causes 
a non-linear round-trip latency function, yielding dif- 
ferent estimates of to for single-packet and multi- 
packet transfers. 


4.2.2. Bandwidth 


Our bandwidth microbenchmark does 500 send { ) 
calls of a given size, and then waits for a response. 
This is repeated until a 95% confidence interval is ob- 
tained. As with the round-trip microbenchmark, the 
TCP/IP and Fast Sockets measurements were derived 
from the same program and Active Messages results 
were obtained using the same algorithm, but different 
code. Again, we used the TCP/IP TCP. NODELAY op- 
tion to force immediate packet transmission, and a 64 
Kbyte socket buffer. 

Results for the bandwidth microbenchmark are 
shown in Figures 5 and 6. Fast Sockets is able to re- 
alize most of the available bandwidth of the network. 
On the UltraSPARC, the SBus is the limiting factor, 
rather than the network, with a maximum throughput 
of about 45 MB/s. Of this, Active Messages exposes 
35 MB/s to user applications. Fast Sockets can realize 
about 90% of the Active Messages bandwidth, losing 
therestto memory movement. Myrinet’s TCP/IP only 
realizes 90% of the Fast Sockets bandwidth, limited 
by its high processing overhead. 

Table 2 presents the results of a least-squares fit of 
the bandwidth curves to the equation 


TonX 
nyt+x 
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(MB/s) 
Fast Sockets | 32.9 


Active Messages 39.2 
TCP/IP (Myrinet) 
TCP/IP (Fast Ethemet) PH. 









Table 2: Least-squares regression analysis of the So- 
laris bandwidth microbenchmark. re is the maxi- 
mum bandwidth of the network and is measured in 
megabytes per second. The half-power point (1 3) 
is the packet size that delivers half of the maximum 
throughput, and is reported in bytes. TCP/IP was run 
with the TCP.NODELAY option, which attempts to 
transmit packets as soon as possible rather than coa- 
lescing data together. 


which describes an idealized bandwidth curve. rao 
is the theoretical maximum bandwidth realizable by 
the communications layer, and n; is the half-power 


point of the curve. The half-power point is the trans- 
fer size at which the communications layer can real- 
ize half of the maximum bandwidth. A lower half- 
power point means a higher percentage of packets that 
can take advantage of the network’s bandwidth. This 
is especially important given the frequency of small 
messages in network traffic. Fast Sockets’ half-power 
point is only 18% larger than that of Active Messages, 
at 441 bytes. Myrinet TCP/IP realizes a maximum 
bandwidth 10% less than Fast Sockets but has a half- 
power point four times larger. Consequently, even 
though both protocols can realize much of the avail- 
able network bandwidth, TCP/IP needs much larger 
packets to do so, reducing its usefulness for many ap- 
plications. 


4.2.3 netperf and ttcp 


Two commonly used microbenchmarks for evaluat- 
ing network software are netperf and ttcp. Both 
of these benchmarks are primarily designed to test 
throughput, although netperf also includes a test 
of request-response throughput, measured in transac- 
tions/second. 

We used a version 1.2 of ttcp, modified to work 
under Solaris, and netperf 2.11 for testing. The 
throughput results are shown in Figure 7, and our 
analysis is in Table 3. 

A curious result is that the half-power points for 
ttcp and netperf are substantially lower for 
TCP/IP than on our bandwidth microbenchmark. One 
reason for this is that the maximum throughput for 
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Figure 5: Observed bandwidth for Fast Sockets, TCP, and Active Messages on HP/UX with the Medusa FDDI 
network interface. The memory copy bandwidth of the HP735 is greater than the FDDI network bandwidth, so 
Active Messages and Fast Sockets can both realize close to the full bandwidth of the network. 
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Figure 6: Observed bandwidth for Fast Sockets, TCP, and Active Messages on Solaris using the Myrinetlocal-area 


network. The bus limits the maximum throughput to 45MB/s. Fast Sockets is able to realize much of the available 
bandwidth of the network because of receive posting. 
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Figure 7: Ttcp and netperf bandwidth measurements on Solaris for Fast Sockets and Myrinet TCP/IP. 
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38.57 560.7 
19.59 687.5 
41.57 785.7 
23.72 1189 


Table 3: Least-squares regression analysis of the 
ttcp and netperf microbenchmarks. These tests 
were run on the Solaris 2.5.1/Myrinet platform. 
TCP/IP half-power point measures are lower than in 
Table 2 because both ttcp and netperf attempt 
to improve small-packet bandwidth at the price of 
small-packet latency. 













TCP/IP is only about 50-60% that of Fast Sockets. 
Another reason is that TCP defaults to batching small 
data writes in order to maximize throughput (the Na- 
gle algorithm) (Nagle 1984], and these tests do not 
disable the algorithm (unlike our microbenchmarks); 
however, this behavior trades off small-packet band- 
width for higher round-trip latencies, as data is held at 
the sender in an attempt to coalesce data. 


The netperf microbenchmark also has a 
“request-response” test, which reports the number 
of transactions per second a communications stack 
is capable of for a given request and response size. 
There are two permutations of this test, one using 
an existing connection and one that establishes a 
connection every time; the latter closely mimics the 
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behavior of protocols such as HTTP. The results of 
these tests are reported in transactions per second and 
shown in Figure 8. 

The request-response test shows Fast Sockets’ 
Achilles heel: its connection mechanism. While Fast 
Sockets does better than TCP/IP for request-response 
behavior with a constant connection (Figure 8(a)), in- 
troducing connection startup costs (Figure 8(b)) re- 
duces or eliminates this advantage dramatically. This 
points out the need for an efficient, high-speed con- 
nection mechanism. 


4.3 Applications 


Microbenchmarks are useful for evaluating the raw 
performance characteristics of a communications im- 
plementation, but raw performance does not express 
the utility of a communications layer. Instead, it is 
important to characterize the difficulty of integrating 
the communications layer with existing applications, 
and the performance improvements realized by those 
applications. This section examines how well Fast 
Sockets supports the real-life demands of a network 
application. 


4.3.1 File Transfer 


File transfer is traditionally considered a bandwidth- 
intensive application. However, the FTP protocol that 
is commonly used for file transfer still has a request- 
response nature. Further, we wanted to see what im- 
provements in performance, if any, would be realized 
by using Fast Sockets for an application it was not in- 
tended for. 
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Figure 8: Netperf measurements of request-response transactions-per-second on the Solaris platform, for a vari- 
ety of packet sizes. Connection costs significantly lower Fast Sockets’ advantage relative to Myrinet TCP/IP, and 
render it slower than Fast Ethernet TCP/IP. This is because the current version of Fast Sockets uses the TCP/IP 


connection establishment mechanism. 


We used the NcFIP ftp client (ncftp), ver- 
sion 2.3.0, and the Washington University ftp server 
(wu-ftpd), version 2.4.2. Because Fast Sockets 
currently does not support fork(), we modified 
wu-ftpd to wait for and accept incoming connec- 
tions rather than be started from the inetd Internet 
server daemon. 

Our FTP test involved the transfer of a number of 
ASCI files, of various sizes, andreporting the elapsed 
time and realized bandwidth as reported by the FTP 
client. On both machines, files were stored in an in- 
memory filesystem, to avoid the bandwidth limita- 
tions imposed by the disk. 

The relative throughput for the Fast Sockets and 
TCP/IP versions of the FTP software is shown in Fig- 
ure 9. Surprisingly, Fast Sockets and TCP/IP have 
roughly comparable performance for small files (1 
byte to 4K bytes). This is due to the expense of con- 
nection setup — every FTP transfer involves the cre- 
ation and destruction of a data connection. For mid- 
sized transfers, between 4 Kbytes and 2 Mbytes, Fast 
Sockets obtains considerably better bandwidth than 
normal TCP/IP. For extremely large transfers, both 
TCP/IP and Fast Sockets can realize a significant frac- 
tion of the network’s bandwidth. 


5 Related Work 


Improving communications performance has long 
been a popular research topic. Previous work has fo- 
cused on protocols, protocol and infrastructure imple- 
mentations, and the underlying network device soft- 


ware. 


The VMTP protocol [Cheriton & Williamson 
1989] attempted to provide a general-purpose proto- 
col optimized for small packets and request-response 
traffic. It performed quite well for the hardware 
it was implemented on, but never became widely 
established; WMTP’s design target was request- 
response and bulk transfer traffic, rather than the byte 
stream and datagram models provided by the TCP/IP 
protocol suite. In contrast, Fast Sockets provides the 
same models as TCP/IP and maintains application 
compatibility. 

Other work [Clark et al. 1989, Watson & Mamrak 
1987] argued that protocol implementations, rather 
than protocol designs, were to blame for poor perfior- 
mance, and that efficient implementations of general- 
purpose protocols could do as well as or better than 
special-purpose protocols for most applications. The 
measurements made in [Kay & Pasquale 1993] lend 
credence to these arguments; they found that mem- 
ory operations and operating system overheads played 
a dominant role in the cost of large packets. For 
small packets, however, protocol costs were signifi- 
cant, amounting for up to 33% of processing time for 
single-byte messages. 


The concept of reducing infrastructure costs was 
explored further in the x-kerne] [Hutchinson & Pe- 
terson 1991, Peterson 1993], an operating system de- 
signed for high-performance communications. The 
original, stand-alone version of the x-kernel per- 
formed significantly better at communication tasks 
than did BSD Unix on the same hardware (Sun 3’s), 
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Figure 9: Relative throughputrealized by Fast Sockets and TCP/IP versions of the FTP file transfer protocol. Con- 
nection setup costs dominate the transfer time for small files and the network transport serves as a limiting factor 
for large files. For mid-sized files, Fast Sockets is able to realize much higher bandwidth than TCP/IP. 


using similar implementations of the communications 
protocols. Later work [Druschel et al. 1994, Pagels 
et al. 1994, Druschel et al. 1993, Druschel & Pe- 
terson 1993] focused on hardware design issues re- 
lating to network communication and the use of soft- 
ware techniques to exploit hardware features. Key 
contributions from this work were the concepts of ap- 
plication device channels (ADC), which provide pro- 
tected user-level access to a network device, and fbufs, 
which provide a mechanism for rapid transfer of data 
from the network subsystem to the user application. 
While Active Messages provides the equivalentof an 
ADC for Fast Sockets, fbufs are not needed, as receive 
posting allows for data transfer directly into the user 
application. 


Recently, the development of a zero-copy TCP 
stack in Solaris [Chu 1996] aggressively utilized hard- 
ware and operating system features such as direct 
memory access (DMA), page re-mapping, and copy- 
on-write pages to improve communications perfor- 
mance. To take full advantage of the zero-copy stack, 
user applications had to use page-aligned buffers and 
transfer sizes larger than a page. Because of these 
limitations, the designers focused on improving re- 
alized throughput, instead of small message latency, 
which Fast Sockets addresses. The resulting system 
achieved 32 MB/s throughput on a similar network 
but with a slower processor. This throughput was 
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achieved for large transfers (16K bytes), not the small 
packets that make up the majority of network traffic. 
This work required a thorough understanding of the 
Solaris virtual memory subsystem, and changes to the 
operating system kernel; Fast Sockets is an entirely 
user-level solution. 


An alternative to reducing internal operating sys- 
tem costs is to bypass the operating system altogether, 
and use either in a user-level library (like Fast Sock- 
ets) or a separate user-level server. Mach 3.0 used the 
latter approach [Forin etal. 1991], which yielded poor 
networking performance [Maeda & Bershad 1992]. 
Both [Maeda & Bershad 1993a] and [Thekkath et al. 
1993] explored building TCP into a user-level library 
linked withexisting applications. Both systems, how- 
ever, attempted only to match in-kernel performance, 
rather than better it. Further, both systems utilized 
in-kernel facilities for message transmission, limiting 
the possible performanceimprovement. Edwards and 
Muir (Edwards & Muir 1995] attempted to build an 
entirely user-level solution, but utilized a TCP stack 
that had been built for the HP/UX kernel. Their solu- 
tion replicated the organization of the kernel at user- 
level with worse performance than the in-kernel TCP 
stack. 


(Kay & Pasquale 1993) showed that interfacing 
to the network card itself was a major cost for 
smalt packets. Recent work has focused on reduc- 
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ing this portion of the protocol cost, and on utiliz- 
ing the message coprocessors that are appearing on 
high-performance network controllers such as Myri- 
net [Seitz 1994]. Active Messages [von Eicken et al. 
1992] is the base upon which Fast Sockets is built 
and is discussed above. Illinois Fast Messages [Pakin 
et al. 1995] provided an interface similar to that of 
previous versions of Active Messages, but did not al- 
low processes to share the network. Remote Queues 
[Brewer et al. 1995] provided low-overhead commu- 
nications similar to that of Active Messages, but sep- 
arated the arrival of messages from the invocation of 
handlers. 


The SHRIMP project implemented a stream sock- 
ets layer that uses many of the same techniques as 
Fast Sockets [Damianakis et al. 1996]. SHRIMP 
supports communication via shared memory and the 
execution of handler functions on data arrival. The 
SHRIMP network had hardware latencies (4-9 ps 
one-way) much lower than the Fast Sockets Myri- 
net, but its maximum bandwidth (22 MB/s) was also 
lower than that of the Myrinet [Felten et al. 1996]. 
It used a custom-designed network interface for its 
memory-mapped communication model. The inter- 
face provided in-order, reliable delivery, which al- 
lowed for extremely low overheads (7 jis over the 
hardware latency); Fast Sockets incurs substantial 
overhead to ensure in-order delivery. Realized band- 
width of SHRIMP sockets was about half the raw ca- 
pacity of the network because the SHRIMP commu- 
nication model used sender-based memory manage- 
ment, forcing data transfers to indirect through the 
socket buffer. The SHRIMP communication model 
also deals poorly with non-word-aligned data, which 
required programming complexity to work around; 
Fast Sockets transparently handles this un-aligned 
data without extra data structures or other difficulties 
in the data path. 


U-Net [von Eicken et al. 1995] is a user-level net- 
work interface developed at Cornell. It virtualized 
the network interface, allowing multiple processes 
to share the interface. U-Net emphasized improving 
the implementation of existing communications pro- 
tocols whereas Fast Sockets uses a new protocol just 
for local-area use. A version of TCP (U-Net TCP) 
was implemented for U-Net; this protocol stack pro- 
vided the full functionality of the standard TCP stack. 
U-Net TCP was modified for better performance; it 
succeeded in delivering the full bandwidth of the un- 
derlying network but still imposed more than 100 mi- 
croseconds of packet processing overhead relative to 
the raw U-Net interface. 


6 Conclusions 


In this paper we have presented Fast Sockets, a com- 
munications interface which provides low-overhead, 
low-latency and high-bandwidth communication on 
local-area networks using the familiar Berkeley Sock- 
ets interface. We discussed how current implementa- 
tions of the TCP/IP suite have a number of problems 
that contribute to poor latencies and mediocre band- 
width on modern high-speed networks, and how Fast 
Sockets was designed to directly address these short- 
comings of TCP/IP implementations. We showed that 
this design delivers performance that is significantly 
better than TCP/IP for small transfers and at least 
equivalent to TCP/IP for large transfers, and that these 
benefits can carry over to real-life programs in every- 
day usage. 


An important contributor to Fast Socket’s perfor- 
mance is receive posting, which utilizes socket-layer 
information to influence the delivery actions of layers 
farther down the protocol stack. By moving destina- 
tion information down into lower layers of the proto- 
col stack, Fast Sockets bypasses copies that were pre- 
viously unavoidable. 


Receive posting is an effective and useful tool for 
avoiding copies, but its benefits vary greatly depend- 
ing on the data transfer mechanism of the underlying 
transport layer. Sender-based memory management 
schemes impose high synchronization costs on mes- 
saging layers such as Sockets, which can affect real- 
ized throughput. A receiver-based system reduces the 
synchronization costs of receive posting and enables 
high throughput communication without significantly 
affecting round-trip latency. 


In addition to receive posting, Fast Sockets also 
collapses multiple protocol layers together and re- 
duces the complexity of network buffer management. 
The end result of combining these techniques is a sys- 
tem which provides high-performance, low-latency 
communication for existing applications. 


Availability 


Implementations of Fast Sockets are currently 
available for Generic Active Messages 1.0 and 
1.5, and for Active Messages 2.0. The soft- 
ware and current information about the status 
of Fast Sockets can be found on the Web at 
http://now.cs.berkeley.edu/Fast- 
comm/fastconm.html. 
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Abstract 


File prefetching is an effective technique for im- 
proving file access performance. In this paper, we 
present a file prefetching mechanism that is based 
on on-line analytic modeling of interesting system 
events and is transparent to higher levels. The mech- 
anism, incorporated into a client’s file cache man- 
ager, seeks to build semantic structures that cap- 
ture the intrinsic correlations between file accesses. 
It then heuristically uses these structures to repre- 
sent distinct file usage patterns and exploits them 
to prefetch files from a file server. We show results 
of a simulation study and of a working implemen- 
tation. Measurements suggest that our method can 
predict future file accesses with an accuracy around 
90%, that it can reduce cache miss rate by up to 47% 
and application latency by up to 40%. Our method 
imposes little overhead, even under antagonistic cir- 
cumstances. 


1 Introduction 


This paper reports the effectiveness of a predictive 
file prefetching technique that operates automati- 
cally and without any sort of information supplied 
by applications or users. The technique, which is 
incorporated into the client’s cache manager, makes 
extra requests to the server, hopefully in advance of 
the actual need for the data. Prefetched data is then 
placed in the client’s cache. 

We hypothesize that there is pronounced regular- 
ity in file access patterns and relatively simple al- 
gorithms can identify an access pattern and quickly 
spot it when it re-emerges later during another run 
of the application. In particular, we build semantic 
data structures, called access trees, that capture po- 
tentially useful information concerning the interrela- 
tionships and dependencies between files. An access 
tree for a program records all the files referenced dur- 


ing one execution of the program. For each program, 
we maintain anumber of access trees in virtual mem- 
ory that represent distinct file usage patterns. When 
a program is re-executed, wecompare the access tree 
being formed by current activity against saved access 
trees to determine which usage pattern, if any, is re- 
curring; we then prefetch the files remaining in the 
saved access tree. 

File prefetching brings two major advantages. 
First, applications run faster because they hit more 
in the file cache. Second, there is less “burst” 
load placed on the network because prefetching is 
done only when there is network bandwidth available 
rather than on demand. On the other hand, there 
are two main costs of prefetching. The first is the 
CPU cycles expended by the client in determining 
when and what to prefetch. Cycles are spent both 
on overhead in gathering the information necessary 
to make prefetch decisions, and on actually carry- 
ing out the prefetch. The second cost is the network 
bandwidth and server capacity wasted when prefetch 
decisions inevitably prove less than perfect. 

We have conducted our study in the UNIX envi- 
ronment, which is ubiquitous in academia and the 
research community. It has been well recognized 
that on UNIX most files are accessed in their en- 
tirety and sequentially [19, 2]. We take advantage of 
this phenomenon in two ways. First, we effectively 
model file system events at the level of whole files, 
and look for access patterns across files. Second, if 
a file is predicted, we prefetch the initial portion of 
the file only, letting the standard sequential reada- 
head mechanism bring in the rest when the file is 
demanded. 

Section 2 details the prefetching mechanism. Sec- 
tion 3 reports results from a trace-driven simulation. 
Section 4 describes the implementation and presents 
some initial performance data. Section 5 discusses 
related work. 
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Figure 1: Example Access Trees. These graphs 
show the access trees generated from the events de- 
scribed in Section 1.2. Graph (a) shows the version 
before compression. Graph (b) that after both verti- 
cal and horizontal compression, assuming E is a shell 
program. 





2 Mechanism 


2.1 Data Abstraction: Access Tree 


Every program can invoke other programs. In 
UNIX-style operating systems, this is usually real- 
ized by the executing program forking child pro- 
cesses, which in turn execute other programs. The 
programs may also open some data files. All the file 
references in an application can be formulated into 
a tree data structure, dubbed an access tree. Files 
(program files and data files) are the nodes, and an 
edge is drawn from parent A to child B if either 
(1) program A invokes program B or (2) program A 
opens data file B. The order of siblings reflects the 
chronology of file accesses. 

Figure 1(a) depicts the access tree for an applica- 
tion A that includes the following activities: 


1. Program A invokes program B 


2. B opens data files C and D, in that order 


ow 


. B opens C again 


> 


. A invokes programs E and F 


5. F opens D 


a 


. E invokes program G 


7. F opens D again 
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An access tree is subject to two kinds of com- 
pression. Vertical compression draws edges through 
UNIX shells, effectively cutting them out of the ac- 
cess tree. This is necessary because of the role shells 
play as command interpreters. Shells are invoked in 
a variety of circumstances and can generate a large 
number of file usage patterns. Since we perform file 
prefetching for each program file in the access tree 
and it is infeasible to prefetch accurately for shells, 
we choose to ignore them. Horizontal compression 
removes consecutive accesses of the same data file. 
The detail of consecutive accesses offers no help in 
prefetching and can be safely omitted. However, 
non-consecutive accesses of the same data file are 
preserved for use in a later phase. Assuming pro- 
gram FE is a shell in the previous example, Figure 
1(b) shows the access tree after compressions. 

An access tree describes the context in which file 
references are incurred in an application. We feel 
that this context information, if made available to 
the prefetching mechanism, would decrease the ap- 
parent randomness of file system events and permit 
more accurate predictions of future file references. 

As an application proceeds, we construct an ac- 
cess tree for it by intercepting fork, execve, open, 
chdir and exit system calls. Execve and open calls 
deal with references of program files and data files 
respectively. Fork provides information on how pro- 
cesses, Or program executions, are associated with 
each other. Information on chdir calls is used to 
resolve relative pathnames of files. The access tree 
is completed upon exit of the program execution. 
An access tree that is being constructed by current 
activity is called a working tree; a finished access 
tree that is saved to exemplify a file usage pattern is 
called a pattern tree. 


2.2 The Prefetch Algorithm 


We use a heuristic function compatibility(wtree, 
p-tree) which returns an indication between 0 and 
1 of the degree to which a working tree w_tree anda 
pattern tree p_tree resemble each other. When the 
function’s value is above constant MATCH_THRESHOLD, 
we say that the two trees match, meaning that they 
are similar enough that they are considered to be- 
long to the same access pattern. We shall explain the 
definition of compatibility function in more detail 
after we discuss the basic operation of the mecha- 
nism. 

As mentioned earlier, a number of pattern trees 
are saved in virtual memory for each program; a 
working tree is constructed in the course of every 
program execution. Whenever a program references 
a file, a new child node is added in the working tree 
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for the program, and some analysis is performed to 
find out whether any saved pattern trees can be 
prefetched. Our analysis follows a simple guide- 
line: if there is no previously prefetched pattern tree 
or the current working tree no longer matches the 
prefetched pattern tree, compute the compatibility 
of the working tree and each of the pattern trees for 
the program. If any match is found, then prefetch 
the pattern tree with the highest compatibility. If 
more than one pattern tree bears the highest com- 
patibility, then prefetch the one most recently saved. 

The above analysis is carried out for each exe- 
cutable file inside the working tree whenever the ex- 
ecutable initiates a file reference. At this point, a 
pattern tree may have already been prefetched for 
the executable, either as the result of prefetch anal- 
ysis incurred by earlier file accesses or in the form of 
a subtree of the larger tree we prefetched at a higher 
level. The executable can always prefetch another 
pattern tree, based on the analysis result. This effec- 
tively allows minor prefetch corrections to be made, 
reducing the cost of a bad guess. 

Two complications may arise when the pattern 
tree selected for prefetching is large: prefetched files 
may be evicted from the cache before they are ac- 
tually referenced, and the cache is considerably de- 
stroyed when the prefetching guess is bad. To di- 
minish the extent of these problems, we place an 
upper limit (PREFETCH.CAPACITY) on the number of 
files from a pattern tree we shall prefetch at one 
time. When the pattern tree selected is too large, we 
prefetch only those files in the initial portion of the 
tree so that the prefetch limit is not exceeded. We 
also record the immediate child node of the pattern 
tree we prefetch last. When later the working tree 
extends to this child, we will prefetch the remaining 
portion of the pattern tree, if the latest working tree 
still matches it. We measure PREFETCH_-CAPACITY 
in number of files, rather than in number of bytes, 
because such a measure goes well with the access 
tree structure, where each tree node represents a file. 
Further, as illustrated later, only the initial portion 
of a predicted file will be prefetched, hence the cost 
of an incorrect prediction is proportional to the num- 
ber of prefetched files. 

On program exit, we compare the newly com- 
pleted working tree with the saved pattern trees. If 
it doesn’t match any of the pattern trees, the work- 
ing tree is saved as new information. Otherwise, it is 
substituted for the pattern tree that it matches best. 

We set the two algorithmic parameters to the fol- 
lowing values: 


e MATCH_THRESHOLD: 0.4 
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Figure 2: Computation of Compatibility. Graph 
(a) shows a pattern tree. Graph (b) shows a finished 
working tree, bearing a 0.575 compatibility with the 
given pattern tree. Graph (c) shows a unfinished 
working tree, so far bearing a 0.5 compatibility with 
the pattern tree; E is the pivot in the pattern tree, 
since it corresponds to the latest child node in the 
working tree being formed. 


e PREFETCH_CAPACITY: 15 


We have found that the behavior of our mechanism 
is insensitive to the value of these parameters. We 
shall illustrate this later. 


2.3 Compatibility Computation 


We now take a closer look at the compatibility 
function. This function, essentially a similarity met- 
ric between access trees, abstracts out the complex- 
ities of prefetch analysis and pattern tree mainte- 
nance. We illustrate the definition with the example 
in Figure 2. A pattern tree is shown in 2(a). 

Let us first consider the case that the working tree 
is a finished one. We try to pair up the immediate 
child nodes in the working tree with identical child 
nodes in the pattern tree, preserving the order of the 
nodes. Recall that a child node can represent either 
an executable file, which may root another access 
tree, or a data file. For the two trees to be con- 
sidered to describe the same file access pattern, we 
require that there be a one-to-one correspondence 
between all the executable files, but not all the data 
files. However, the same executable file may root a 
different access subtree in the working tree than in 
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the pattern tree. We define Cg as the percentage of 
data files that can be paired up, C, as the percent- 
age of pairs of executables that root the same access 
subtree. Intuitively, Ca suggests how “compatible” 
the data files are, C, suggests how “compatible” the 
executable files are. We choose to use the average of 
the two values as the compatibility of the two trees 
in question. 

Let us consider the finished working tree in Figure 
2(b). There are three data files in the working tree 
and two in the pattern tree. Out of these five files, 
only the two appearances of E can be paired up, 
making Cg 0.4. Three out of four executable pairs 
(B, F and G) root identical access subtrees, so C¢ is 
0.75. The compatibility is thus 0.575. 

The case of an unfinished working tree is similar 
except that one child in the pattern tree is first deter- 
mined to be the pivot node. The pivot corresponds 
to the most recently added child node in the work- 
ing tree. Only the child nodes in the pattern tree 
that appear before the pivot are involved in compat- 
ibility computation. If the pattern tree is selected, 
we prefetch those files that follow the pivot in the 
sequence of pattern tree preordering, since those be- 
fore the pivot probably have already been accessed. 
Given the unfinished working tree in Figure 2(c), 
node E is the pivot in the pattern tree. Cg and C, 
are both 0.5, giving rise to a 0.5 compatibility. If we 
decide to prefetch this pattern tree, only the files in 
subtrees Tr3 and Tg, will be prefetched. 

Since the compatibility function is invoked often, 
it is important that it not be expensive. That is why 
we examine only immediate child nodes. The time 
complexity of the computation is proportional to the 
number of child nodes. 


3 Simulation 


Our initial assessments of the mechanism were ob- 
tained by trace-driven simulation. 


3.1 Method 


We gathered several file traces [24] on SunOS 4.0.3c. 
This version of SunOS offers a “C2 secure comput- 
ing facility” that includes the ability to produce a 
system call audit trail. Using this feature, we gath- 
ered three traces of a volunteer user performing his 
normal work activity over a period of two weeks. 
The first trace contains 25,441 invocations of pre- 
viously enumerated system calls captured over 72 
hours. The second trace contains 23,858 invocations 
captured over 52 hours, while the numbers for the 
third trace are 85,962 and 86, respectively. During 
these hours activity varied widely and included com- 
pilations, document production, data analysis and 
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display, large file searches, news reading, printing, 
and other operations. 

Our simulated cache manager used an LRU re- 
placement policy. It stepped through the traces, 
maintaining the cache in accordance with the user 
file accesses. Since our trace data lacks file sizes, we 
defined cache size by number of files. We varied the 
cache size and for each cache size, we compared the 
results of prefetching against those without prefetch- 


ing. 
3.2. Results 


Our first metric was the cache miss rate. The results 
for the three traces appears in Figure 3. Prefetch- 
ing delivers a substantially better miss rate with all 
cache sizes. The results are worse for the second 
trace because it includes a recursive directory traver- 
sal (i.e, the UNIX find program) over a hierarchy 
that includes thousands of files that are never ac- 
cessed again. 

In addition, we monitored the cache behavior at 
a finer grain: how well our mechanism worked over 
each run of 500 accesses. At the end of each run, we 
checked to see whether prefetching had beaten no- 
prefetching by measuring the number of misses dur- 
ing that run. As shown in Table 1, prefetching won 
this comparison easily and consistently. This shows 
that our mechanism’s superiority is steady and sta- 
ble, and not simply a result of a few exceptionally 
fruitful prefetch sequences. The few losses due to 
bad prefetching guesses are more than offset by the 
many wins. 

Both Figure 3 and Table 1 indicate that the in- 
creased intelligence of our mechanism is more effec- 
tive in smaller caches. This was expected. Consider 
that, in the extreme case of an infinitely large cache, 
any file appearing in any access tree is still in the 
cache. There is no room for improvement from any 
prefetching algorithm that is based solely on infor- 
mation about the past. 

We measured the accuracy of our prefetch deci- 
sions, defined as the percentage of file access predic- 
tions that were actually used. We also calculated 
one overhead of the mechanism, i.e, the percent- 
age of the file fetches that were initiated due to bad 
prefetch decisions. This reflects the network band- 
width and server capacity that were wasted. Table 2 
shows these results. (We shall address another over- 
head, the CPU cycles expended by the prefetcher, in 
Section 4.) Because of the LRU policy, bigger caches 
give better accuracy results because prefetched en- 
tries have a better chance to survive cache entry re- 
placements. Similarly, bigger caches also give better 
overhead results because bad predictions have a bet- 
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Figure 3: Cache Miss Rate. These charts compare 
the accumulative cache miss rate with prefetching 
(PF) against that without prefetching (No PF). The 
comparison is performed under three different cache 
sizes (S), measured by number of files. 


ter chance to have already existed in the cache; in 
such a case, no fetch is performed. 

The compatibility function plays a critical role 
in our mechanism. The assumption underlying this 





Table 1: A Finer-grained Comparison. At the 
end of each run of 500 file accesses, we determined 
whether prefetching had beaten no-prefetching by 
measuring the number of cache misses during that 
run, 
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Table 2: Prefetching Accuracy and Overhead. 
The accuracy is the percentage of the predictions 
that were actually used. The overhead is the per- 
centage of the file fetches due to bad guesses. 







Cache Size | Accuracy (%) | Overhead (%) 
13 


function is that if the initial portions of two access 
trees bear a high compatibility, so will the two trees 
in their entirety. In order to test this assumption, we 
examined the number of loads, t.e., the total number 
of times we prefetched pattern trees. We further 
classified these loads as either proper or improper. A 
load became improper when a different pattern tree 
was selected by the prefetcher to take the place of 
the current one, or when the finished working tree 
did not match the pattern tree loaded. Otherwise, 
the load was proper. Our results in Figure 4 suggest 
that the compatibility function is effective. 

Finally, we note that the algorithmic parameters 
should be stable: small variations in the settings 
must not produce large performance degradations. 
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Figure 4: Effectiveness of the compatibility 
Function. This figure illustrates, for each trace, 
the split of pattern tree loads into proper ones and 
improper ones. 


To illustrate the stability of the algorithmic parame- 
ters, we measure the mechanism over a wide range of 
parameter values. In Figure 5, we show the results 
for Trace 3 and a cache size of 50. Three measure- 
ments are presented: cache miss rate, predication 
accuracy and prefetch overhead. The results for the 
other two traces and cache sizes are similar. 

The simulation demonstrates that our algorithm 
can accurately predict future file accesses based on 
past file usage. The major limitation of the simula- 
tion study is that it does not account for the relative 
timing of events due to the absence of such informa- 
tion in the traces used. As a result, prefetched files 
are assumed to appear in the cache instantaneously. 
It remains to be determined whether a real system 
will have the resources to exploit the information on 
future file accesses. The limitations of the simula- 
tion motivated us to conduct a full implementation 
and further evaluations. 


4 Implementation 


We have implemented our mechanism in UX42 [8], 
a BSD UNIX server running on Mach 3.0 [1]. UX42 
resides in user space and is organized as a collection 
of C threads [5]. Most threads handle BSD system 
calls. Among the others are NFS [23] async dae- 
mons, which handle asynchronous NFS block I/O 
requests. Since we expect that network file ac- 
cesses would be the performance bottleneck in a 
client-server architecture, we prefetch only NFS files 
opened for read. 
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Figure 5: Mechanism Stability. This fig- 
ure presents measurements of the prefetching 
mechanism over a wide range of parameter val- 
ues. MATCH.THRESHOLD was varied from 0 to 1. 
PREFETCH.CAPACITY from 0 (no prefetching) to 50. 
The measurements were taken for Trace 3 and a 
cache size of 50. 
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41 Structure 


Figure 6 shows the basic structure of the imple- 
mentation, where each box stands for a C thread 
and the shaded area constitutes the prefetcher. The 
prefetcher consists of two pieces of code. The first 
is the system independent prefetch engine, which 
handles prefetch analysis, working tree construction 
and pattern tree maintenance. The same code was 
used in the simulation. In the implementation, the 
BSD service threads were modified to provide the 
prefetch engine information on each fork, execve, 
open, chdir and exit system call. The prefetch en- 
gine processes this information, makes prefetch deci- 
sions and enters the files to be prefetched in a queue. 
The second piece of prefetcher code is an added 
thread called the prefetch daemon. The prefetch 
daemon consumes the file queue and produces block 
read requests in another queue. These requests are 
then satisfied by the async daemons. Unlike a user- 
initiated read operation, a prefetch ends when the 
block is placed in the system buffer cache. No copy 
to user space is necessary. 





UxX42 


BSD Service 


NFSAsync 


Oaemons 





Figure 6: Structure of Implementation. A 
prefetch daemon is added to the collection of C 
threads in UX42. Each BSD service thread is ex- 
tended to call the prefetch engine when a relevant 
system call is serviced. 





The prefetch daemon takes advantage of the NFS 
readahead logic by prefetching only the first block of 
a file. When a requested block number is one more 
than the number of the block last read (r_lastr), 
NFS readahead logic speculates that the file is being 
accessed sequentially, and initiates an asynchronous 
read on the next block together with the requested 
block. At least two requested blocks are needed to 
establish the sequential access pattern before NFS 
starts readahead. Accordingly, when it removes an 
entry from the file queue, the prefetch daemon gener- 


ates a read request for only the first file block (block 
0). It also sets r_lastr to -1. Later when block 0 is 
actually accessed, the request will hit in the cache, 
and, since r_lastr is -1, a readahead on block 1 will 
be issued. This scheme moves on as the file is read 
block by block, which is the norm. Since the prefetch 
daemon issues only one block read for each file to be 
prefetched, the cost of prefetching is minimal. 

We have ensured that prefetch I/O yields to regu- 
lar NFS asynchronous I/O. All the NFS async dae- 
mons can be used towards regular I/O, but only up 
to a certain number of them towards prefetch I/O. 
Prefetch I/O will be started only if this limit has not 
been reached and there is no pending regular I/O. 
Thus, regular asynchronous 1/O’s can be serviced 
promptly and when there are too many of these, 
the prefetcher will refrain from issuing any prefetch 
I/O’s. This ensures that prefetching halts when the 
system capacity limit is reached, and does not add 
extra load to an overload. 

The implementation consists of approximately 
3380 lines of C code. Of these, 90 lines have been 
added to the existent UX42 source files, 2280 lines 
are in separate “.c” files and 1010 lines in “.h” files. 


4.2 Controlled Experiments 


We started the evaluation of the implementation 
with two simple, controlled experiments. The fol- 
lowing questions motivated our experiments: 


e How much are the potential benefits of prefetch- 
ing? 


e Under antagonistic circumstances, what is the 
bearing of prefetching on performance? 


e What is the CPU overhead due to prefetching? 


The first experiment is a shell script that consists 
of tens of filter programs, each of which reads one 
parametric input file, performs some transformation, 
and writes one output file. Many well known UNIX 
programs are filters; included in our script are: awk, 
compress, sed, sort, strings, uniq, uuencode, and 
members of the grep family. All input files are the 
same size and reside remotely. 

The second experiment is composed of several pro- 
gram builds. Although builds may seem ideally 
suited to prefetching, they are antagonistic, for two 
reasons. First, on our hardware platform a build is 
CPU intensive, leaving the prefetch mechanism rel- 
atively little excess CPU cycles with which to make 
prefetch decisions. Second, compilations often ac- 
cess header files rapid-fire, meaning that the time be- 
tween a prefetch decision and the actual need for the 
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Cache Miss Rate 
Buffer Cache 
No PF 


File Size 
(blocks) 


[30.30% | 0.00% | 68.47% | 464% |[ 34.24 (118) 
46.39 (0.27) 124 


(a) 


Name Cache 
PF 





Latency 


PF | Speedup | 


1.67 





File Size 
(blocks) 


Latency 
PF | Speedup 


1 || 41.67% 68.47% 23.94 (0.54) | 14.33 (0.89) 
2 || 52.63% | 0.88% | 68.47% | 6.72% || 30.07 (1.24) | 20.34 (0.66) 


30.30% | 1.07% | 68.47% | 6.63% || 38.76 (1.10) | 28.38 (1.24) 





4 
8 || 16.39% | 0.07% | 68.47% | 4.92% || 54.04 (1.28) | 42.94 (1.10) 


2 
(b) 


Table 3: Filters Experiment. This table summarizes the performance results of the filters experiment. 
Part (a) presents the results with the 10 Mb/sec wired link; part (b) with the 2 Mb/sec wireless link. 


data may not be sufficient to complete the prefetch 
1/0. 

The experiments were run standalone. Both the 
client and the server are 486 processors with 16MB 
of memory. Theclient dedicates 1.57MB to its UNIX 
buffer cache. Since we are interested to find out 
how our mechanism behaves in relation to differ- 
ent network bandwidth, we ran the tests using both 
hardwired and wireless links directly connecting the 
client and server. The hardwired connection is an 
Ethernet (10 Mb/sec), while the wireless link is an 
NCR “WaveLAN” [26] radio link with a maximum 
2 Mb/sec data rate. For each combination of exper- 
iment and network link, we ran the tests both with 
and without prefetching. Each number reported be- 
low is the mean of three trials. 


Table 3 shows the results when the filters experi- 
ment is run with a wired link. We varied the work 
load by using different input file sizes, as given in the 
“File Size” column. Size is measured in multiples of 
the server’s preferred NFS block size, which is 4KB. 
For each work load, we list the UNIX buffer cache 
miss rate as well as the directory name lookup cache 
miss rate. Since our mechanism deals exclusively 
with network files, the cache miss rates are those 
of remote entries. The reduction of the miss rates 
due to prefetching is substantial. We also present 
the measurements of application latency, or total 
elapsed time, which we couldn’t do in the simulation. 
From a user’s standpoint, latency is the most impor- 
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tant performance metric. The time measurements 
are in seconds, with standard deviations included in 
parentheses. The “Speedup” column is the ratio of 
no-prefetching latency to prefetching latency. The 
speedups are significant, particularly when the in- 
put files are relatively small, since the delay caused 
by a sequential file access mainly lies in the first one 
or two block reads. The remaining blocks, if any, 
will be brought into cache by NFS readahead logic 
before they are needed. 

Table 3(b) shows the results when the same ex- 
periment is run over a wireless link. The application 
latency is larger because of the slower link, but the 
speedups are comparable to those in a wired setup. 
The cache miss rates with prefetching are slightly 
higher over a wireless link than over a wired link 
since a small amount of prefetch operations cannot 
be completed in time for demanded use. Neverthe- 
less, the bandwidth of the wireless link is adequate 
to perform NFS readahead on a timely basis, so it 
still suffices for the prefetcher to read only the first 
file block for each prediction it makes. 

Figure 7 graphically illustrates the application la- 
tency of the filters experiment. 

Our second experiment consists of builds of several 
UNIX utility programs. Table 4 contains the results 
of these tests. The application latency is also illus- 
trated in Figure 8. When a large number of header 
files are opened in quick succession — common dur- 
ing compilation — the prefetcher often does not have 
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Figure 7: Filters Experiment: Comparison of Application Latency. These graphs illustrate the 
latency data in Table 3. Part (a) compares latency in the wired setting; part (b) in the wireless setting. 
Prefetching results in significant speedup in this experiment. 





enough time or available CPU cycles to make a fruit- 
ful prefetch even though it can speculate accurately 
which files will soon be referenced. The relatively 
small name cache miss rate improvements confirm 
this. However, prefetching still manages to reduce 
the buffer cache miss rate by 65% to 85%. There is 
no significant latency enhancement, suggesting that 
the total file read time is not dominant in the appli- 
cation latency. We are pleased to see that no nega- 
tive effects are observed in an adverse case such as 
this. 

We have used simulation traces to demonstrate 
the prefetch overhead in terms of wasted network 
bandwidth and server capacity. The implementation 
enables us to collect data on another type of cost, ex- 
tra CPU consumption. Table 5 presents the results 
for the two experiments. The CPU time includes the 
system time and user time. Again the time measure- 
ments are in seconds and the standard deviations are 
given in parentheses. The CPU time overhead is neg- 
ligible in all cases. We also show the ratio of CPU 
time to application latency (CPU/latency). It comes 
as no surprise that prefetching increases this ratio, 
substantially in the first experiment. A prefetcher 
reduces the total elapsed time by increasing the par- 
allelism between CPU processing and I/O. 


4.3 Discussion 


There are three necessary conditions for prefetch- 
ing to be useful. First, there must be spare ca- 
pacity in the whole “data pipe” that extends from 
server’s disk to client’s cache. This pipe consists of 
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the client’s network I/O interface, the network, the 
server’s CPU, and the server’s disk and network I/O 
paths. Second, there must be system resources on 
the client side for the prefetcher’s use. Of special im- 
portance are CPU cycles. The prefetcher cannot ful- 
fill its duty if it doesn’t acquire CPU cycles promptly, 
even though the amount of CPU time needed is low. 
Third, the workload must allow some interval be- 
tween file accesses so that the prefetched I/O can be 
started and completed ahead of the demand. 


Prefetching proves feasible on both wired and 
wireless network connections. Although our exper- 
iment results suggest that the speedups for these 
two situations are comparable, the speedup may be 
even better appreciated by the client at the end of 
a wireless link. With the slower link, a job lasts 
longer and more real time can be saved via prefetch- 
ing. As the ratio of CPU speed to network speed 
increases, prefetching should provide more benefit. 
However, there is some subtlety with our implemen- 
tation scheme. When the ratio is lowered to a cer- 
tain point, we will no longer be able to depend on 
the NFS readahead function to bring in subsequent 
file blocks quickly enough. This difficulty can be 
tackled by a simple generalization of our initia] ap- 
proach. The prefetcher can always prefetch the first 
N file blocks (N growing with the CPU/network ra- 
tio) and the NFS readahead code can be modified 
to read the next N-th file block, instead of the im- 
mediately next block. Currently, N is simply set to 
F, 
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Buffer Cache 


Cache Miss Rate 
Utility 


Latency 


9.03% | 784% || 30.93 (0.44) | 29.81 (0.49) 


10.57% | 3.17% | 10.35% | 9.83% || 64.11 (1.05) | 63.63 (0.68) 





[Lor | 
23.42% | 3.30% 6.90% || 25.48 (0.78) | 24.70 (2.51) 


Cache Miss Rate 
Buffer Cache 


NoPF 


Utility 


(a) 


[PF | 


Latency 


hash info || 32.09% | 5.61% 6.99% || 26.22 (0.66) | 2479 (0.47) | +106 | 


snames || 17.20% | 2.96% | 9.09% | 7.91% || 38.17 (1.17) | 33.89 (1.64) 
machid || 9.15% 10.39% 77.31 (3.25) | 74.42 (2.13) 


machipc 





327% 29.10 (0.36) | 27.17 (0.30) 
(b) 


Table 4: Builds Experiment. This table summarizes the performance results of the builds experiment. 
Part (a) presents the results with the 10 Mb/sec wired link; part (b) with the 2 Mb/sec wireless link. 


5 Related Work 


Prefetching is an old idea. It has been studied 
extensively in various areas, including prepaging, 
prefetching of files and prefetching of database ob- 
jects. Prepaging has not had a major impact in com- 
puter architecture because of the tight time and com- 
plexity constraints on paging hardware and software. 
However, prefetching of files and database objects is 
a more promising endeavor for two reasons. First, 
since file and database accesses are less frequent than 
page accesses, the speed with which the decision to 
prefetch must be made is not so much of the essence. 
Secondly, the resource most needed to arrange intel- 
ligent prefetching, namely client CPU cycles, is the 
resource most in excess in distributed systems now 
and in the likely future. 

Some researchers have looked into prefetching 
blocks within files. Sequential readahead [7, 17] 
is the most primitive and yet successful approach. 
The work of Kotz and Ellis (see, for example, [14]) 
focuses on the uses of prefetching to increase I/O 
bandwidth on MIMD shared-memory architectures. 
Their prefetching methods are geared to the patterns 
of many scientific and database applications running 
on multiprocessors. In contrast to these methods, 
our work looks for access patterns across files. 

Some file prefetching methods require that each 
application inform the operating system of its future 
demands. This includes the TIP project by Patter- 
son et al. [21, 22] and the work of Cao, Felton, Karlin 
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and Li [3, 4]. These researchers consider the interac- 
tion between prefetching and caching more carefully. 
TIP uses application-disclosed access patterns to dy- 
namically allocate file buffers between the competing 
demands of prefetching and caching, based on a cost- 
benefit model. Cao et al. allow applications to pass 
down both prefetching hints and caching hints. They 
then employ an integrated algorithm for prefetch- 
ing and caching, which is shown to be theoretically 
near-optimal. These “informed” approaches possess 
an advantage over ours in that prefetching is driven 
not by deductions made after snooping, but rather 
by certain knowledge provided in advance by higher 
levels. There is no danger that disastrously incorrect 
speculative prefetching might trash the cache. On 
the other hand, these approaches require re-coding 
applications. Also, the prefetching mechanism must 
act within the interval between when the higher level 
learns of the need to do I/O and when it actually ini- 
tiates I/O; this interval may not always be sufficient 
to perform the prefetch I/O. 


Like ours, a number of other prefetching methods 
are completely transparent to clients. They use past 
accesses to predict future accesses. While we seek 
to build semantic structures, 7.e., access trees, that 
are endowed with application-level meaning, most of 
the other approaches use a probabilistic method to 
model the user behavior. In its most general form, 
a probabilistic method regards file references as a 
string of events, and uses information about the last 
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Figure 8: Builds Experiment: Comparison of Application Latency. These graphs illustrate the 
latency data in Table 4. Part (a) compares latency in the wired setting; part (b) in the wireless setting. 
There are no observable negative effects in this antagonistic experiment. 


C events to estimate the frequency distributions of 
the next L events (C and L are called contezt or- 
der and lookahead size, respectively). For simplicity, 
however, existing methods have either single context 
order or single lookahead. Unlike our tree-based ap- 
proach, there is no attempt to build in an “under- 
standing” of why files are likely to be referenced to- 
gether. 

Some recent examples of work based on probabilis- 
tic methods are a paper by Curewitz et al. [6] on 
prefetching objects in an object-oriented database, 
work by Grifficen and Appleton [9, 10] and by 
Kroeger and Long [15] that prefetch whole files. 
Both [6] and [15] adapt context modeling techniques 
used in data compression to predict the next access. 
Their work is inspired by the idea that a good data 
compressor should be able to predict future data 
well. Griffioen and Appleton’s work, in comparison, 
employsa “probability graph” that for each file accu- 
mulates frequency counts of all files that are accessed 
within the lookahead window. 

In the initial stages of our work, we considered 
probabilistic modeling. We implemented (in the 
simulator only) two extremely simple probabilistic 
methods, which we called “stupid pairs” and “smart 
pairs:” 


e Stupid pairs: When file F is accessed, prefetch 
the file that was accessed immediately following 
F at the last time F was accessed. 


Sample series of accesses: F, G (remember F- 
G), F (prefetch G, remember G-F), H (remem- 
ber F-H), F (prefetch H, remember H-F), H 
(cache hit). 


e Smart pairs: Keep track of all files that are 
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accessed immediately after F, and when F is 
accessed the next time, choose one according to 
the frequency distribution. 


Sample series of accesses (only F’s pairs are 
shown): F, G (F-G1), F (prefetch G), H (F- 
(Gl, H1]), F (prefetch G or H with 1:1 weight- 
ing), H (F-(G1, H2]), F (prefetch G or H with 
1:2 weighting). 


The stupid pair scheme worked better than the 
smart pair approach when applied to our traces. 
This unexpected result is explained by considering 
the locality of many accesses: often, a user works 
on one file or one group of files for some time, then 
moves on to similar operations with different files. 
Stupid pairs are well-equipped to handle this usage 
pattern, since they invariably prefetch the most re- 
cently used successor file. The seemingly superior 
intelligence of smart pairs actually becomes a liabil- 
ity when locality is strong; files no longer in active 
use may be given undue weight if they were heavily 
accessed in the past. 

The phenomenon, that less information is bet- 
ter provided that it is more recent, identifies one 
common weakness of existing probabilistic methods: 
they don’t address the problem of aged information. 
Instead they make predictions based on a total his- 
tory of accesses. In addition, single lookahead meth- 
ods like [6] and [15] are less likely to be widely appli- 
cable: as I/O latencies grow in terms of CPU cycles, 
prefetches must begin ever further in advance if they 
are to complete in time. On the other hand, single 
context order methods like that of Griffioen and Ap- 
pleton fail to make full use of historical file access 
information, and thus are unable to confidently in- 
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Wired Link (10 Mb/sec) 


CPU/latency | 
File Size 


[NoPF| PF 


Wireless Link (2 Mb/sec) 
CPU/latency 
No PF No PF PF 
382 (0.28) 36.87% || 3.69 (0.07) 
5.39 (0.71) 33.29% || 4.77 (0.09) | 5.83 (0.14 
) 4 


| __—«4'|] 7.88 (0.49) | 8.78 (0.79) | 23.00% | 35.58% || 7.39 (0.22) | 8.60 (0.35) | 19.07% | 30.29% 
ee 


13.24 (1.03) | 14.45 (0.53) | 28.54% | 38.58% || 13.14 (0.08) | 14.58 (0.08) | 24.32% | 33.95% 


Wired Link (10 Mb/sec) 
CPU Time CPU/latency 
No PF No PF PF 


Wireless Link (2 Mb/sec) 


CPU Time 
NoPF P 
0.37) | 24.08% 5.17 (0.1) 
0.80) | 33.48% | 37.73% || 10.69 (0.10) 
1.28) | 47.24% 30.96 (1.82) 
0.45) | 26.45% | 30.90% || 6.91 (0.27) 28.729 


(b) 


Table 5: CPU Time Consumption. This table presents the CPU time with and without prefetching, over 
two different network links. Also given is the ratio of CPU time to application latency. Part (a) give the 


PF 

5. 84 
11.25 
32.08 
7.63 


30.29 (0.92) 
| _machipc |] 6.74 (0.22) 


results for the filters experiment; part (b) for the builds experiment. 


fer accesses far into the future. 


Similar to our work, Palmer and Zdonik’s work 
on Fido [20] also explicitly recognizes and maintains 
access patterns. Several important aspects make 
their work different. First, their work was conducted 
in the context of object-oriented database systems, 
whereas our context is file systems. Second, they 
represent access patterns with strings of object iden- 
tifiers, with no semantics involved. We represent 
patterns with access trees. Third, they employ spe 
cialized pattern memory, while we store pattern trees 
in virtual memory. Finally and most notably, Fido 
requires a separate training phase after each user ses- 
sion, while our mechanism is more on-line: there is 
no off-line computation or periodic analysis needed. 


An issue that is related to prefetching is hoard- 
ing [12, 11, 16, 25]. Both prefetching and hoard- 
ing involve anticipatory file fetches: bringing files 
from remote servers into a local cache before they are 
needed. These are not exactly the same techniques, 
however. Hoarding is a scheme designed to increase 
the likelihood that a mobile client will be able to 
continue working during periods of total disconnec- 
tion from file servers. Since hoarding is a relatively 
infrequent operation performed only at a client’s re- 
quest prior to disconnection, timing is not critical. 
On the other hand, prefetching is mainly concerned 
with improving performance and timing is impor- 
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tant. With prefetching, the file server is assumed to 
be still accessible, although the network connectiv- 
ity may be weak. A cache miss is much more catas- 
trophic in disconnected operations, hence hoarding 
is typically willing to overfetch substantially in order 
to enhance the availability of files. Despite the dif- 
ferences, our idea of uncovering and exploiting the 
semantic structure underlying file references also ap- 
plies to the hoarding problem, as shown in [25]. 


6 Conclusion 


We have presented a technique for transparent on- 
line file prefetching. The technique analytically 
models interesting system calls and builds seman- 
tic structures that capture the intrinsic correlations 
between file references. It makes accurate predic- 
tions of future file accesses, imposes little CPU over- 
head, defers to demand I/O, and delivers substan- 
tially lower client cache miss rates and elapsed time 
for I/O-intensive applications. 

One central trait of the algorithm is that it spends 
client CPU cycles in return for more effective use 
of client cache space and fewer on-demand network 
operations. Another distinguishing aspect is that 
the algorithm’s lookahead ability is potentially much 
greater than that of previous work. Both of these 
traits help to couple application I/O performance 
more closely to CPU speed than to I/O device speed, 
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thereby addressing a fundamental and longstanding 
problem in operating systems [18]. 

Our initial performance evaluation has been en- 
couraging. We intend to extend the evaluation along 
several directions. First, we would like to run ex- 
periments that model user behavior more realisti- 
cally. We are in the process of synthesizing work- 
loads based on actual file system traces. Second, 
we plan to conduct experiments over a much wider 
spectrum of network and client capacities. We hope 
to reach a good understanding on when prefetching 
is feasible. Third, we would like to examine the ap- 
plicability of our prefetching algorithm to different 
kinds of file systems: local file systems, remote file 
systems with caching of file blocks in main memory 
only, and remote file systems with whole file caching 
on client disks. 
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Abstract 


When a machine is connected to the Internet via a slow 
network, such as a 28.8 Kbps modem, the cumula- 
tive latency to communicate over the Internet to World 
Wide Web servers and then transfer documents over the 
slow network can be significant. We have built a sys- 
tem that optimistically transfers data that may be out of 
date, then sends either a subsequent confirmation that 
the data is current or a delta to change the older version 
to the current one. In addition, if both sides of the slow 
link already store the same older version, just the delta 
need be transferred to update it. 

Our mechanism is optimistic because it assumes that 
much of the time there will be sufficient idle time to 
transfer most or all of the older version before the 
newer version is available, and because it assumes that 
the changes between the twoversions wil] be small rel- 
ative to the actual document. Timings of retrievals of 
random URLs in the Internet support the former as- 
sumption, while experiments using a version reposi- 
tory of Web documents bear out the latter one. Per- 
formance measurements of the optimistic delta sys- 
tem demonstrate that deltas significantly reduce la- 
tency when both sides cache the old version, and op- 
timistic deltas can reduce latency, to a lesser degree, 
when content-provider service times are in the range of 
seconds or longer. 


1 Introduction 
The Internet, and particularly the World Wide Web 


(W*), consists of an ever-increasing number of 
servers, networks, and personal machines with dra- 
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matically varying qualities of service. Individuals 
often access the W? via modems with bandwidth 
of 14.4-28.8 Kbps, while their provider might have 
a Tl link or better to the rest of the Internet. But 
then the actual access may be to another site with 
low bandwidth, high server load, or both. Thus the 
latency to respond to the user’s request for a page 
is unpredictable: often fairly low, but sometimes 
extremely high. The unpredictability and generally 
slow response may be exacerbated in environments 
with even poorer quality of service, such as cellular 
telephony or wide-area wireless networks. 


A number of techniques have been implemented or 
proposed to deal with HTTP latency. A browser can 
direct itsrequest to a proxy-caching server (henceforth 
referred to as a server proxy) on the other end of the 
low-speed connection, and then the latency for retriev- 
ing pages elsewhere in the Internet can be eliminated 
when someone else has retrieved those pages in the 
recent past. (Recency is a function of the size of the 
cache, any expiration dates in the pages, and any con- 
straints passed from the browser to the cache [5, 12]. 
Also, some pages are flagged as uncacheable, and the 
proxy-caching server is obliged to pass those requests 
through to the content provider. Caching is discussed 
further in Section 2.) America Online (AOL) uses a 
proprietary protocol between the browser on a user’s 
machine and an enormous cache of W? pages within 
the AOL server cluster. Prefetching pages during peri- 
ods when the modem would otherwise be idle can re- 
duce or eliminate the latency of following a link, if the 
prefetch is accurate and the user thinks between clicks 
long enough for prefetching to complete [1 8, 22]. Pad- 
manabhan and Mogul’s study of persistent HTTP con- 
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nections [17] indicates that a persistent connection be- 
tween a client and proxy, or between one of them and a 
content provider, will eliminate TCP connection setup 
and slow-start overhead. 

We look at the problem of latency from another per- 
spective: using computation to improve end-to-end 
network latency. The idea of trading off computation 
for I/O bandwidth has appeared numerous times in past 
systems. Examples include application-specific deltas 
and compression, such as Low-bandwidth X[15]; com- 
pressed network or disk I/O [3, 6, 7]; replicated file 
systems [2]; shared memory [16]; and checkpoint- 
ing [9, 19]. It seems that the same tradeoffs apply in 
the domain of the W°. 

We address the issue of latency from the perspec- 
tive of sending the differences between versions of a 
page, or deltas, in order to avoid sending entire pages. 
Briefly, deltas are used in two ways. First, if both ends 
of the slow link store the same version of a page, and 
the server proxy obtains a new copy of the page, it can 
send a delta to the user’s machine. This hopefully will 
reduce the transfer time on the slow link. Second, if 
the user’s machine does not store the older version, the 
server proxy can send a potentially obsolete version of 
the page immediately and request the new version in 
parallel. It followsthis with a delta against the obsolete 
version if necessary. Thus, the idle time on the slow 
link when the server proxy is waiting for a response 
from the end server is not wasted. 

The size of the delta may tum out to be large, in 
which case the server proxy may have to abort send- 
ing the stale version (if one is being sent) and just send 
the current page received. In this case, work for send- 
ing the stale data and calculating the delta is wasted. 
Worse, as the server proxy does not pipeline the data 
from the content provider to the client but instead waits 
till the whole page has been received (since it needs 
to compute a delta), the end latency as perceived by 
the client may be somewhat larger. Our approach opti- 
mistically assumes that this case is uncommon; hence 
we refer to the case where stale data is transferred as 
an optimistic delta. The experiments described in this 
paper support this assumption. In contrast, we refer to 
the case where both sides share a cached version as a 
simple delta. 

AS we were going to press, we found that the “‘sim- 
ple deltas” case is similar to a system from IBM called 
WebExpress [13]. WebExpress is geared toward a low- 
bandwidth wireless environment, where bandwidth is 
precious, so it has similar goals but makes different 
trade-offs. It focusses on small changes to dynamic 
data (CGI output), sending deltas between a base ver- 
sion that is shared by the client- and server-side “‘in- 
tercepts” (similar to the client and server proxies de- 
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scribed here). However, WebExpress has apprently not 
been used so far for arbitrary W? pages, nor does it 
send stale data optimistically. The limited bandwidth, 
high error rate, and contention in a wireless environ- 
ment suggest that transferring data that may not be used 
could have more negative consequences than over a 
single-user modem. 

Our work also has some similarity to Dingle and 
Partl [5], who proposed that a hierarchy of proxy- 
caching servers could be used to send stale data as a 
MIME multipart document, causing the browser to dis- 
play the stale data immediately and to replace it with 
more recent versions as they become available. The 
main difference with our work is that Dingle and Partl 
do not address the latency of obtaining the final ver- 
sion of the data. Our approach attempts to reduce this 
latency by sending the the current version as a delta. 
In addition, we do not display the stale page: it is in- 
tercepted by a client proxy [20], typically co-located 
with the browser on the same machine, and passed to 
the browser once it is updated or known to be current. 
In the case where the “stale” page is actually current, 
our system behaves like the proposal in [5]. Finally, in- 
tercepting the stale data by a client-side proxy permits 
the transmission of the stale data to be aborted trans- 
parently once it becomes clear that simply sending the 
current version is more efficient (e.g. , if the delta turns 
out to be too large relative to the page, or if the current 
page became available very quickly). 

The rest of this paper is organized as follows. Sec- 
tion 2 provides some background into caching in the 
HTTP protocol and an analysis of HTTP latency. Sec- 
tion 3 discusses some data analysis to support our hy- 
potheses that deltasizes will be small and that there will 
be sufficient idle time to transfer stale copies. Next, 
Section 4 describes the design of our system, and Sec- 
tion 5 covers experimental results. Finally, Section 6 
discusses the status of the system and future work, and 
Section 7 offers some conclusions. 


2 Background 


In this section we briefly describe caching in HTTP and 
analyze the dynamics of atypical HTTP transfer. This 
description provides background for the discussion in 
the following sections. 


2.1 Caching in the HTTP Protocol 


The HyperText Transfer Protocol (HTTP) [4] supports 
caching in clients (i.e. browsers) and intermediate 
servers known as proxy-caching servers. To display a 
page, a client without a cached copy will uncondition- 
ally send a page request to the content-provider or a 
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proxy-caching server. A client with the page already 
cached may return the cached copy or contact another 
host to determine whether the page has changed. This 
check for page currency is done by sending an HTIP 
GET request with a header specifying If-Modified- 
Since followed by the timestamp of the cached copy. 
If a proxy-caching server is used, it can respond to 
a page request or currency check request if it has a 
cached copy that is deemed usable and if the client has 
not specified that the cache must not be used (with a 
Pragma: no-cache! directive, most commonly sent 
when a user tells a browser to reload a fresh copy of 
a page). 

The proxy-caching server decides whether or not the 
cached page is usable and, if so, whether or not the 
client’s cached copy is current, based on the optional 
extra information that the clients can send with their re- 
quests and which HTTP servers can send back with a 
page. Beyond the Pragma: no-cache directive men- 
tioned above, the client may specify a bound on the age 
of the cached copy it is willing to accept (the Cache- 
control: max-age directive). Content-providers can 
specify when the page was last modified, whether or 
not the page can be cached (the Cache-control: no- 
cache field), and how long a client or proxy-caching 
server should cache the page (the Expires field). Dy- 
namic data, often the output of aCGI script, is typically 
sent with no Last-Modified timestamp and set up to 
expire from the cache immediately (equivalent to dis- 
abling caching). 

Currently, if a page has no Last-Modified times- 
tamp, checking for the freshness of a cached copy 
Tequires retrieving the fresh copy from the content 
provider and shipping it all the way to the client 
browser. Similarly, changes to a page with the times- 
tamp will require the proxy to obtain the file from the 
content provider and transmitting the entire file to the 
client; the transmission is elided only if the page has 
not been modified at all. 


2.2 Analysis of HTTP Latency 


We now present an analysis of the timing dynamics 
of a typical HTTP transaction. We assume a setting 
where the Web browser (client) talks to a server proxy 
on the other side of a low bandwidth and high latency 
link, which in tum talks to HTTP servers (content- 
providers) on the Internet. We use this analysis to make 
a case for the usefulness of optimistic deltas. 
Consider Figure 1, which depicts the timeline cor- 
responding to an HTTP transaction between a client, a 


1HTIP 1.1 alternatively supportsa Cache-control: no-cache 
directive, which when sent by the client is equivalent to Pragma: 
no-cache. In this paper we refer to the pragma to mean either header. 
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client server proxy content-provider 





Figure 1: Timeline for typical HTTP request. 


server proxy, and a content provider. Packets 1, 2 and 
3 correspond to the SYN, SYN/ACK and ACK pack- 
ets that are exchanged as part of the TCP 3-way hand- 
shake in the connection establishment phase between 
the client and the server proxy. The client sees this con- 
nection as established after a delay of approximately 
2 * Lstow Where Lsiow is one half of the round-trip 
latency of the link between the client and the server 
proxy. At this point the client sends packet 4 which 
contains the HTTP request. After the server proxy gets 
this packet (at approximately time 3+ Lsioy ), if itneeds 
to go to the end server to satisfy this request or to val- 
idate its cached copy, it initiates a connection with the 
content provider (packets 5, 6 and 7). The latter con- 
nection is established after a further delay of 2+ Lyast, 
where Las; is one half of the round-trip latency of the 
link between the server proxy and the content provider. 
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The server proxy then forwards the client’s HT IP re- 
quest to the content provider (packet 8). 

Onreceipt of this HTTP request the content provider 
does whatever is necessary to produce the response. 
This may include several relatively slow activities, 
such as forking off a child server and/or passing de- 
scriptors to it, forking off a script to produce the re- 
sponse document, invoking and waiting for the re- 
sponse from a search engine, etc. For our purposes we 
will view these activities as a certain latency shown as 
time Twai: in the figure. The magnitude of Tyai: is 
thus a function of the server in question, the particu- 
lar URL being served and the load on the server. This 
time can be highly variable. Some simple experiments 
on the Internet indicate that it varies from a couple of 
hundred milliseconds to several seconds. After the re- 
sponse (packet 9) is available and sent off it takes an- 
other L = Lstow + Lyase time before the client starts 
seeing it. The amount of time Tiransyer that it takes 
to transfer the response is dependent on the size of the 
response and speed of the link. 

If we put in typical values for the various time 
parameters in the figure (Lsiny = 160ms, Lyas: = 
60ms, Twaiz Varying from 200ms to several sec- 
onds, Tiransser for a 2-KByte response at 20 Kbps is 
800ms”), we notice that if Twait is large, the low band- 
width link is idle for a substantial portion of the time. 
Also, regardless of the magnitude of Tyaix We want to 
minimize the amount of data we transfer over the slow 
link. These two observations lead to the idea of opti- 
mistic deltas. We want to effectively utilize the slow 
link in its idle time, and if possible, reduce the amount 
of data transferred, in order to reduce the total latency 
perceived by the client. If the client and server proxy 
both cache the same older version, only a deltaneeds to 
be sent over the slow link decreasing Ti,ans ser. If the 
client and server proxy do not cache the same version, 
the server proxy can transfer its cached version while 
waiting for data from the content-provider and subse- 
quently send a delta. Here again, Ttrans yer iS shorter 
if Twaiz islong enough. 


3 Data Analysis 


Simple deltas benefit by trading off computation of the 
deltas for a reduction in bandwidth and latency over the 
slow link when both sides store the same old version 
of a page. Optimistic deltas trade off an increase in the 
amount of data transferred, by sending an older version 
during an idle time of the slow link followed by a delta, 
for a reduction in end-to-end latency. The viability of 


? This number is very approximate and will in practice be larger 
because of TCP’s congestion control algorithms. 
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either form of deltas is thus dependent on “smallness” 
of deltas, which we evaluate in Section 3.1. Optimistic 
deltas depend additionally upon long idle times on the 
slow link, which we consider in Section 3.2. 


3.1 Delta Sizes 


To test the hypothesis that deltas would be sufficiently 
small, we wanted to take a sample of W? pages and see 
how large the differences were between two versions of 
the same page, relative to the page itself. We consid- 
ered two sources of sample pages: the version archive 
of the AT&T Internet Difference Engine (AIDE) [8], 
which stores versions of pages for future visual com- 
parison of their changes, and a set of random URLs 
obtained from AltaVista [1]. The random pages were 
tracked by AIDE as well, but they were considered sep- 
arately from those pages that were actually registered 
explicitly, either by individuals or by inclusion in a list 
of popular URLs collected from a set of bookmark files 
within AT&T. The random URLs and popular URLs 
were archived daily if changes were detected, while 
the pages tracked for individual users were typically 
archived upon explicit request. 

Throughout our experiments, we computed deltas 
using vdelta, a program that generates compact deltas 
by essentially compressing the deltas in the process of 
computing them, and which can be used as a stand- 
alone compression program as well [10].2 We must 
consider the possibility that W% pages that are com- 
pressed in a stand-alone fashion will compress so well 
that the deltas between two versions of a page are not 
much smaller than the compressed page. In this case 
the client and server proxies could merely compress ev- 
ery page (or rely on compression in the modems) with- 
out using deltas and have the same benefit. We will 
see that in practice, however, deltas are substantially 
smaller than simple compression. 

Considering first the non-random pages, of a total of 
380 pages in the archive, 181 had more than one ver- 
sion, with a mean of 4.9 versions/page (o = 10.3). 
Figure 2(a) shows a plot of delta size against original 
file size. The delta size is usually a small fraction of 
the original file size. By comparison, Figure 2(b) plots 
delta size against the size of the newer file once com- 
pressed. Figure 2(b) indicates that the delta is consis- 
tently much smaller than the compressed file, though 
in some cases it is approximately the same; this usually 
happens when a file has changed completely from one 
version to the next. Even if the file does not compress 


3 In fact, one can consider a delta of B compared to A as com- 
pressing B with the strings of A already in the compressor’s table of 
prefices: if B is similar to A then it will compress well, and if not, 
it will still compress well if it has internal similarity (as most ASCII 
text does). 
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(a) Deltas compared to original file sizes. 
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(b) Deltas compared to compressed output sizes. 


Figure 2: Comparison of sizes of deltas and original or compressed pages, using vdelta. While deltas have a greater 
benefit when simple compression is not taken into account, they help above and beyond the benefits of compression. 
In each graph, the dashed line indicates the break-even point and the solid line depicts the mean across all files. 


well (for instance, it is a GIF file), the worst that vdelta 
will do is to reproduce the original file with a few bytes 
of overhead. The mean across over 2200 comparisons 
of delta/compressed-file ratios was 19% (0 = 27.6%). 


The outlying pointsin Figure 2, which are due to one 
GIF file that has been archived automatically each day 
and a compressed postscript file with two versions in 
the archive, might be aconcem in practice if the system 
were to send stale copies and compute deltas regardless 
of file type. Fortunately, file types are identifiable, both 
from the Content-type HTTP header and data within 
the files, so it is possible to treat images and other non- 
textual data specially. One might instead use a distil- 
lation technique to send a version of an image that is 
more appropriate for a low-bandwidth link [11]. 


Our study of 1000 random URLs from AltaVista [1] 
found that 861 URLs were actually accessible at the 
time we started tracking them, and the vast majority 
(79%) of those URLs were not modified in the next two 
months of daily checks. Figure 3 graphs the distribu- 
tion of the number of versions detected for the remain- 
ing 21% that were modified. Just 43 of the 861 URLS 
(5%) had 40 or more versions over the two months of 
the study; the minor variations in the number of ver- 
sions of the frequently-changed pages may be due to 
transient errors while contacting those hosts. We also 
performed the above analysis of delta sizes for these 
pages, and found that the mean delta size was just 3.7% 
of the original page size (c = 6.9%), and 10.4% of 
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Figure 3: Distribution of the number of versions de- 
tected by daily checks of 861 randomly selected URLs 
over a two-month period, for the 21% of pages that 
were modified. 
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the compressed page size (o = 14.0%). The pages 
themselves compressed to an average of 12.1% (¢ = 
18.0%) of their original size. A possible explanation 
for the small deltas is that the sample was dominated by 
pages that changed daily, and those changes may often 
have been the inclusion of timestamps or other small, 
simple modifications. 


Another consideration is what sorts of data can 
be compared. Even dynamic pages, which aren’t 
cacheable, might have a lot of overlap between ver- 
sions of the same page, or pages with the same base 
URL but different parameters to a CGI script. (Deter- 
mining when to compare one URL against a slightly 
different URL for differencing is an open question, but 
as long as both the client and server proxies agree on 
the versions being compared the system will act cor- 
rectly. ) 


For example, a query to the AltaVista search en- 
gine [1] might result in a page containing several links 
to content and several more links to other URLs within 
AltaVista. The “boilerplate” can dominate the content 
that changes from page to page, because each page con- 
tains the same form at the top and, at the bottom, a 
set of links to each other page generated by the query. 
Figure 4 graphs the sizes of deltas from two queries, 
compared to the size of the page if it were just com- 
pressed. The first, aname lookup, returned 9 pages; the 
second, a query with many terms (“storage manage- 
ment mobile computing flash memory nvram”) that 
generated thousands of matches, returned 20 pages, of 
which 10 were compared. In each case the deltas from 
one page to the next, within a given search result, were 
much smaller even than the compressed pages. 


3.2. HTTP Latency 


To get a sense for the likelihood that a request would 
take a long time to start receiving data, we collected 
1000 random URLs from AltaVista [1] and timed their 
responses. This study differs somewhat from Viles and 
French [21], who studied the availability of random 
HTTP servers and the time to connect to them; here we 
are seeing how long it takes tocollect the first data from 
a W3 page. Figure 5 shows the results of this experi- 
ment, based on the 722 URLs that returned data within 
the first minute. We found that about a third of pages 
responded within a second, assuming they responded 
at all, and half responded within about 1.6s. However, 
it takes 5s to cover “ of the pages and 10% took 10- 
30s or more for the first data to arrive. As more pages 
on the W® are dynamically generated, we expect the 
fraction of pages with sluggish response to increase. 
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Figure 4: Example of deltas for two AltaVista queries, 
one for a person and one a mixture of computer science 
terms. All comparisons were pairwise in sequential or- 
der, starting with a delta between the first two pages of 
a query result. The URL of each page varied slightly 
because it specified the range of responses to return (1- 
10, 11-20, etc.). 
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Figure 5: Distributionof response times to receive first 
data, based on a sampling of 722 URLs. Roughly 5 
of the sample received data within a second, and 3 re- 
ceived data within 5 seconds. 
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4 System Design 


4.1 Design Considerations 


Once we were confident that sending deltas could often 
reduce bandwidth requirements and/or client-observed 
latency of Web access, we had to consider two issues. 
One was what the architecture of a delta-based sys- 
tem would be. We rejected any scheme that would re- 
quire changes to HTP orto existing content providers, 
though we later leaned that HTIP/1.1 will support 
a PATCH directive to allow efficient uploading of 
changes to shared documents. Instead, we settled on 
a proxy-based architecture in which the browser con- 
nects to a proxy on the same machine (the client proxy), 
and that proxy in turn connects to a server proxy on 
a well-connected host. We have control over both of 
these proxies; in fact we use a common source code 
base for them, though that is not necessary. The design 
of the proxies is covered in greater detail in Section 4.2. 

The second open issue was how the deltas would 
help when the user’s machine does not store an old ver- 
sion of the page. If the well-connected proxy has many 
clients, or can talk to a nearby proxy that does, it may 
have fast access to acached copy of the requested page. 
When that page is current, nothing need be done other 
than sending the cached copy over the phone line. If it 
is stale, however, there might be an opportunity to use 
deltas after sending the stale data. 

In fact, one might expect the server proxy to handle 
a number of clients and to keep multiple versions of 
each document in its cache. Sometimes a client’s re- 
quest will specify a version which the caching proxy 
has, and then only a delta needs to go back over the 
slow link. Other times, the server proxy will not have 
the same version cached as advertised by the client in 
which case it will try to utilize the Ty,4;z time by send- 
ing in the stale copy it has and subsequently sending 
the delta. Since in the latter scenario the benefit is po- 
tentially lower, especially when Tia: iS not very large, 
we may bias the system so that it hits the first scenario 
more often than the second. This may be brought about 
by auxiliary mechanisms like prefetching during idle 
times to keep the client and server proxies’ caches in 
sync. 

In total, there are numerous ways in which the opti- 
mistic delta mechanism can improve the efficiency of 
W? access: 


e The client and server proxies may share an out-of- 
date version. Consider the case when the client 
proxy sends an If-modified-since request with a 
No-cache directive to the proxy, which caches the 
same version, and assume that the page has been 
modified on the content provider site. In this case, 


the proxy obtains the page, computes the delta 
and, if it is smaller than the whole page, sends 
the delta instead of the whole page to the client. 
Thus, the demand for bandwidth of the (slow) link 
to the client is decreased. Again, we refer to this 
case as a “‘simple delta,” which is less “optimistic” 
than others: it only relies on the delta being small 
enough to be beneficial but does not risk transfer- 
ring useless data. 


The server proxy may have the current version, 
but the client proxy wants to check the validity 
of its own cached copy with the content provider. 
Assume the client sends an I f- modified-since re- 
quest with No-cache to the proxy server, which 
caches a newer version that is the same as the ver- 
sion on the content provider site. In this case, 
the proxy immediately sends its copy to the client 
(marked as Stale); in parallel, it sends an If- 
modified-since request to the content provider, 
verifies that its copy is actually current and sends 
a null deltato the client. The browser can display 
the page as soon as the conditional GET retums 
via the server proxy, rather than having the newer 
contents of the page transferred starting then. 


The same latency reduction applies if the client 
has no cached copy but requests the most current 
version of a page, since the server proxy can send 
a Stale copy and then confirm that the copy is cur- 
rent after its own If- modified-since request. 


The server proxy may have a newer version than 
the client, as well as the client’s cached version. 
Assume the client proxy asks the server proxy for 
a page that the client has cached, and the server’s 
copy is more recent but is not necessarily the most 
current version. The server can respond with a 
delta against the client’s version. If the page is 
out of date or the client specifies that the cache be 
bypassed, the content provider is consulted and a 
second delta can be sent if needed. 


The client and server proxies may share a current 
version of an “uncacheable” page, one that must 
be retrieved directly from the content provider on 
each access. Our system permits the client and 
server proxies to cache such pages as a basis for 
deltas between versions of a page, while ensur- 
ing correctness by providing the browser with the 
most current version every time. The server proxy 
can determine that the page is unchanged (using 
a regular GET over the high-speed network and 
comparing the contents with the cached version) 
and notify the client proxy to use the cached ver- 
sion rather than transferring the page over the low- 
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speed network. Note that while other systems do 
this for cacheable pages, ours does this even for 
uncacheable ones as far as the slow link is con- 
cemed, while taking advantage of deltas when the 
differences are small. 


4.2 Architecture 


Figure 6 shows the architecture of our system. The 
browser connects to a local (client) proxy using the 
usual HTTP proxy-caching mechanism, by sending it 
requests containing the full URL of a desired page. We 
assume that the browser does not cache pages, but in- 
stead relies on the client proxy; however, this does not 
affect correctness, only storage utilization and caching 
effectiveness. The client proxy serves the request out 
of its cache if possible, or forwards it to the server 
proxy on the other side of the slow connection. The 
added overhead of a client proxy on the same machine 
is minimal by comparison to network delays and so the 
analysis in Section 2.2 still holds. 

The server proxy can respond using its cache if the 
request is for a cached page, the page is not out of date 
with respect to its expiration date (if any), and the client 
has not issued the nocache pragma. Otherwise, it for- 
wards the request upstream, either directly to the con- 
tent provider or to yet another proxy-caching server. 
In either case, if the client and server proxies share a 
cached version, a delta from that version to the current 
version can be sent once the current version is avail- 
able. If they do not share a cached version but the 
server proxy has some version cached, the server can 
send the possibly stale version followed by a delta from 
that version to the current version. 

The server proxy determines what the client proxy 
has cached via some extra headers in the HTTP request. 
First, an Accepts: multipart/attdelta field indicates 
that the client understands the delta format. This way, 
a browser or other unmodified client can talk to our 
proxy without getting back something it cannot inter- 
pret. Second, a Current-version: {signature] field in- 
forms the server which version the client has cached, if 
any. The signature can in principle be anything that can 
distinguish different versions of a document, such as an 
MDS checksum. We make the simplifying assumption 
that any client proxy that requests a new version from 
a server proxy will! have received the previous version 
from the same proxy, and we use a monotonically in- 
creasing version number (instead of a checksum) that 
the server generates and passes to its clients. 

Table 1 summarizes the possible combinations of 
client/server proxy states and the procedures that are 
followed. 
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4.3 Detecting Non-cost-effective Trans- 
fers 


As was mentioned in the introduction, one difference 
between our system and the proposal of Dingle and 
Part] [5] is the ability to abort the transfer of stale data. 
In fact, there are two cases when it is more appropriate 
to behave like standard proxies and just send the cur- 
rent version of a page to the client without delta pro- 
cessing. 

In the “simple delta” case, the client and server prox- 
ies cache the same stale version of the page, and only 
the delta need be sent. If the deltais as large as the cur- 
rent page, the current page rather than the delta must be 
sent. If the delta is somewhat smaller, one could use 
heuristics to decide whether the cost of recreating the 
current version from the delta exceeds the benefits due 
to bandwidth reduction. We currently transfer the delta 
any time it is smaller than the original. 

The other case occurs when an optimistic transfer is 
in progress and the new version of the page starts to ar- 
rive at the server proxy. If most of the stale version has 
been sent and the delta is small, it pays to finish send- 
ing the stale version; if there is a lot of data yet to be 
transferred and/or the delta is large, the transfer should 
be aborted and the current version should be sent as it 
becomes available. 

If the cost of recreating the page from the stale ver- 
sion and the delta is negligible, and we assume that the 
two versions are the same size, then we should continue 
to send the stale version any time the remaining stale 
data plus the size of the delta is less than the size of the 
current version. (In practice, we will know the size of 
the current version if the Content-length header field 
is present.) Assume the length is L and the size of the 
delta is 6L. If we have already transferred 6L bytes of 
the stale version, then transferring the remainder plus 
6L will require no more bytes than sending all L bytes 
of the current version. 

Since we have no way of knowing how large a par- 
ticular delta will be, any scheme that depends on com- 
puting the delta only after the whole response has been 
received by the server proxy can sometimes perform 
badly. However, there is an entire family of increas- 
ingly sophisticated abort schemes that one can think of, 
which can be integrated into the process of receiving 
the current version, producing the delta on the fly, and 
aborting if the delta appears large. 


5 Experiments 
Ideally, we would like to perform a long-running ex- 


periment that would compare end-to-end performance 
of our optimistic delta mechanism with existing proxy 
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Figure 6: System architecture. 
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Table 1: Possible states when requesting a URL. ‘Cached copy current?” refers to whether theserver proxy can respond 
using its cached copy without consulting the content provider. 
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caching. One way to perform such measurements 
would be to get a set of random URLs, and then re- 
quest each URL periodically for an extended time us- 
ing the optimistic delta approach and the existing ap- 
proach, and compare the average response time. We 
plan to perform such experiments in the future. 

To date, we have focussed on “‘microbenchmarks” to 
study the two extremes of interest: the case when the 
client does not have a page cached, and must obtain a 
full copy of the page (possibly a stale copy followed by 
a delta or confinnation that it is current); and the case 
when the client and server have the same copy cached 
and the server can send a delta. We compare each of 
these against the response time of an unmodified sys- 
tem that connects directly to the server proxy and does 
not use deltas. 


5.1 Experimental Setup 


We performed our tests using a pair of Intel-based sys- 
tems running the proxy code and a Sun SparcStation 
providing content. Specifically, the client system was a 
Pentium 133Mhz based machine with 32MB of RAM, 
running BSDI’s BSD/OS v2.1. (Our proxy code is 
very portable across Unix platforms and currently com- 
piles without any change on SunOS, Solaris, Linux, 
FreeBSD and BSD/OS. Its main system dependencies 
are BSD sockets and vdelta so we expect it should 
be easily portable to any system that has BSD sock- 
ets and some kind of difference library that supports 
binary files.) It was connected using an AT&T Para- 
dyne Comsphere 3820Plus modem at 28.8 Kbps to a 
dial-in server on the AT&T corporate network. The 
server proxy ran on an identical machine in the AT&T 
network one hop away from the dial-in server. The 
server connected to an HTTP daemon (hrd, an inter- 
nally developed server) on a uniprocessor SparcSta- 
tion 20 on the same Ethemet segment as the server 
proxy. This provided relatively fine-grained control 
over the latency between the server proxy and the con- 
tent provider. 

The “browser” was a simple C program that fetched 
a series of URLs specified in a control file by communi- 
cating with the client proxy, whichalsoran on the client 
machine. The “‘browser” did not do any caching. 


5.2 Test Data 


Here we report a performance evaluation using a syn- 
thetic workload based on the multi-version archive of 
W? pages collected by the AT&T Internet Difference 
Engine (described above in Section 3.1). This archive 
reflected the actual evolution of the pages, although it 
did not contain copies of every version of every page: 
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some pages were archived automatically once per day 
when changes were detected, while the majority were 
archived upon the explicit instruction of a user of the 
system. 

Slightly over half of the pages had only one ver- 
sion archived; these reflected pages that were regis- 
tered with the system but had either never changed or 
(more likely) were not archived automatically and had 
not been selected for subsequent archival by auser. We 
excluded these pages from the benchmark because no 
deltas were available. On the other hand, about 10% 
of the 380 pages had 10 or more versions archived, 
and several had 50 or more versions (the latter were all 
pages that were archived automatically). 


5.3. Benchmark 


The purpose of the benchmark was to examine the ef- 
fect of several parameters on end-to-end latency in the 
optimistic delta system: delta size, server latency, and 
cache contents. We considered delta size by retrieving 
many pages with different characteristics. We exam- 
ined the effect of server latency by varying the response 
time prior to sending data to the server proxy (see be- 
low). Finally, we evaluated the difference between 
sending deltas to a client with the past version of the 
page cached and one without it cached. (In the case of 
the unmodified proxy, without deltas, caching was ir- 
relevant because each version of the page was retrieved 
exactly once.) All requests were made with Pragma: 
no-cache, and the pages always differed upon each re- 
trieval. Other cases, such as when the page has not 
changed or the server proxy can retum its cached copy, 
are relatively uninteresting: they either favor the opti- 
mistic approach or are equivalent between the two sys- 
tems. 

In each run of the benchmark, the client system re- 
trieved each URL repeatedly, once for each version 
that existed. A CGI script mapped the URL into a 
local filename that is dependent on the next version 
for that URL: the first GET on /deltatest/pageN re- 
tumed /deltatest/pageN/1, the second returned /2, 
and so on. Thus to the “browser” (actually the bench- 
mark program) and the proxies, the same URL mapped 
to new versions of the page upon each request. 

In addition, the CGI script read a file to determine 
how much of a delay it should insert before respond- 
ing, which was used to simulate delay in the Inter- 
net and/or on the content provider. Longer delays per- 
mit more of an opportunity for optimistic transfers of 
large documents but also place a larger lower bound 
on end-to-end latency: if a server takes a minute to re- 
spond, then it will be at least a minute before the client 
proxy knows ithas the current version of the page. We 
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Figure 7: Average uncompressed data size across all 
versions of each page, sorted by size, shown on a log 
scale. 


therefore expected that the runs with no added latency 
would show a high benefit from cached deltas and suf- 
fer delays from ill-placed optimistic transfers of the old 
copy that could not fit into the Ti,4;, idle period, while 
Tuns with significant added latency would favor the op- 
timistic approach but gain less from sending cached 
deltas (the amount of time saved as a fraction of total 
latency would decrease due to the fixed overhead). 

For the current set of experiments, we were forced to 
restrict the range of parameters to keep the experiments 
tractable in a limited time frame. We did this in two 
respects: 


e We ran experiments using fixed delays of Os and 
5s, to show extreme cases: what happens when 
data is nearly immediately available, and what 
happens when old data can be transferred during 
idle times. 


We restricted the maximum number of versions 
for a given page to 10, under the assumption that 
the behavior for the first 10 versions of a page 
would be representative of the entire set. 


5.4 Results 


To make it easier to interpret the results, we sorted 
URLs by the average uncompressed file size. Figure 7 
graphs the average size (across all versions of a page) 
as a function of the sorted URL numbers, which are 
used in the other graphs below. 

Figure 8 shows our results. Each graph plots the av- 
erage ratio of end-to-end latency using the modified 
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system to latency using the unmodified system*. The 
left column shows cases where the client proxy caches 
the previous version (simple deltas), while the right 
column shows the use of optimistic deltas. The first 
row shows no added content provider latency, and the 
second row shows 5s of added latency. The URLs are 
sorted in the same sequence as in Figure 7. The solid 
line in each graph indicates the mean of all the points 
in the graph, while the dashed line indicates the break- 
even point. 

The cost of computing deltas and patching was neg- 
ligible (1-2%) compared to the network transfer time 
and protocol processing overhead in all our experi- 
ments. Moreover, the largest measured value of over- 
head from computing a delta and applying the delta on 
the client was much less than the typical variation in the 
total URL fetch times. 

From Figure 8 we draw the following conclusions: 


e The pages with the lowest index, which have the 
largest original file size, tended to show more im- 
provement than the smaller pages, but although 
the general trend is upward as one moves right 
along the X-axis, there are great variations from 


page to page. 


As expected, without added latency, many of the 
pages took longer using the optimistic approach 
than without it. The measurements in Figure 8(b) 
were taken with a simple abort strategy in place. 
This strategy aborted only when the server proxy 
had finished computing the delta and the amount 
of remaining stale data plus the delta was more 
than the size of the regular response. We ex- 
pect that a smarter abort strategy, such as abort- 
ing an ongoing optimistic transfer of stale data and 
“cutting through” new data even as it is being re- 
ceived if it appears that it is very different from 
the cached stale data, would cap the latency of our 
system at close to 100 % of the unmodified system. 


With 5s added latency, most pages were received 
faster by the client using optimistic deltas, with a 
mean improvement of 27%. In fact, the latency 
for optimistic deltas with 5s added delay was con- 
sistently somewhat less than that for simple deltas 
with the same delay. We attribute the better per- 
formance of optimistic deltas to TCP’s slow-start 
algorithm [14]. In the case of the optimistic deltas 
the transfer of the stale data opened up the TCP 
congestion window, so the deltas were transferred 
faster in this case than in the case of simple deltas, 


4In the unmodified system, there were no proxies involved and 
the client talked directly to the content-provider. 
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Figure 8: Experimental results, showing ratios of end-to-end latency for modified versus unmodified system, varying 
whether old versions are cached on the client or sent optimistically by the server, and whether the content provider 
adds Os or 5s of latency before returning content. Each data point represents the average across all versions for the 
corresponding page. The solid line in each graph indicates the mean of all the points in the graph, while the dashed line 
indicates the break-even point. The ‘simple delta” case never experienced aborts, while the ‘‘optimistic delta” case 
experienced aborts that are indicated with a different symbol. 
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where the transfer of the delta had to open up the 
congestion window itself. 


Nearly all of the simple deltas improved perfor- 
mance regardless of added latency, which one 
would expect. As predicted, the relative gain 
was generally better when the fixed overhead was 
lower, In the case of Os added latency, most of the 
points that showed degradation were cases of only 
two versions being available (hence a greater like- 
lihood of variability due to extemal factors) and 
where the deltas were 40-60% of the original file 
size. The overall improvement was 33%, which 
was the best of the four configurations. 


6 Status and Future Work 


At present, all of the functionality described in this 
paper has been implemented except for the ability of 
the server-side proxy to store multiple versions of the 
same URL. We plan to implement these features and 
test the system in a multi-user environment, where the 
same server proxy handles requests for multiple users. 
We believe that in this environment there will be many 
more cases where the server proxy has content that 
a particular client proxy does not, resulting in more 
optimistic transfers than would occur in a single-user 
context. One must evaluate policies for determining 
how many versions to keep and how many concurrent 
clients can be supported by a server proxy. 

We plan toexpand the URL comparison logic to han- 
dle the case of variants on the same URL (including the 
part of aGET URL that specifies CGI parameters), as 
described above in Section 3.1. In fact, it might be pos- 
sible to hash the contents of pages to find other pages 
that are substantially similar and would generate small 
deltas. 

Currently, each communication between the 
browser and the client proxy, or between the client 
and server proxies, requires a TCP setup. Persistent 
HTTP [17] should improve performance further, but 
we have not yet implemented a persistent connection 
in our proxy. 

In the current system, deltas are generated only when 
the current version of a document has been received in 
its entirety. We intend to add incremental delta genera- 
tion so that the delta can be sent over the slower link as 
content is received, and so that itis possible to abort op- 
timistic transfers early if the delta appears to be large. 
It is also possible to use historical data to estimate the 
usefulness of sending stale data: if Tai: for a partic- 
ular host or page is usually very small, then one might 
not bother with the optimistic transfer. 


Finally, it should be useful to integrate prefetch- 
ing into the optimistic delta system. In addition to 
prefetching new pages through the server proxy to the 
client (similar to the studies mentioned above [18, 22]), 
we can prefetch deltas to keep the proxies’ caches bet- 
ter synchronized. 


7 Conclusion 


We have proposed an optimistic deltas approach to re- 
duce the latency of accessing W 3 pages. This approach 
involves sending the differences between versions of a 
page, or deltas, to the client, instead of sending entire 
pages. It also permits stale data to be sent during pe- 
riods of inactivity. Our approach is optimistic because 
it sends data that may not be needed; instead, it opti- 
mizes for the common case when pages change incre- 
mentally, at the expense of a slight overhead in the rare 
cases when a modification drastically changes the con- 
tent of the page. In other words, we assume that in most 
cases when acopy cached by the proxy is deemed unus- 
able, it is either still current, or, ifit has been modified, 
the size of the modification is considerably smaller than 
the page itself. 

Our study of an AT&T multi-version archive of W4 
pages confirmed the above assumption. In fact, by ex- 
amining the extent to which the results of AltaVista 
queries with slightly different parameters differ, we 
showed that this assumption may even hold for dynam- 
ically generated pages. However, in general we expect 
that other sorts of data, such as images, should be han- 
dled specially rather than processed as deltas. 

A study of the latency to obtain W ° pages confirmed 
that the latency in obtaining data may often be suffi- 
cient to send stale data, for the purpose of sending a 
small delta once the data is available. However, perfor- 
mance may be degraded when latency is low and more 
sophisticated techniques for deciding when to abort the 
transfer of stale data are required. 

We implemented our approach without changing the 
browser. Instead, we configure the browser to connect 
to a client proxy on the same machine, which in tum 
connects to a server proxy. These proxies have been 
modified to follow the optimistic deltas approach. We 
compared the performance of this configuration with 
the original system. This performance study, based on 
microbenchmarks, showed a significant latency reduc- 
tion achieved by our approach: an average of 12-33% 
improvement across all pages in the study, depending 
on system parameters, with some transfers improved 
by an order of magnitude. One particularly surprising 
result was the effect that transferring potentially stale 
data had on the TCP slow-start algorithm when a link 
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is otherwise idle, consistently improving end-to-end la- 
tency. 

While a long-term experiment that would compare 
the performance of our approach with existing proxy 
caching systems on real-life workloads is needed, the 
experiments described in the paper strongly suggest 
that the optimistic delta mechanism results in a consid- 
erable reduction of W? latency. 
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Abstract 


Partially connected operation is the circumstance 
in which the communication link between two 
computers is intermittent, either by choice or be- 
cause or failure. This paper describes the design 
and performance of a user level toolkit that is 
suited to accessing the home directory of a par- 
tially connected user in a bandwidth-efficient man- 
her. 

Compared to custom file systems for partially 
connected operation, the toolkit is easier to deploy 
and provides extra degrees of flexibility because it 
runs at user level and uses the local file system for 
its cache. The toolkit makes it possible for unal- 
tered clients to access NFS file systems exported 
by unaltered servers. 

The maintenance of consistency between client 
and server is automatic provided that certain as- 
sumptions are upheld, the primary one being that 
sharing is limited in such a way that there is a 
“single locus of update.” That is, for extended 
periods, updates are applied either to the client’s 
cache or directly to the server, but not to both 
simultaneously. This pattern of use is typical of a 
user’s home directory. 

Performance is the disadvantage of user level 
operation. The client’s cache is managed by a 
“caching tool” that services every file system op- 
eration, and this redirection increases latency sub- 
stantially. 


1 Introduction 


This paper explains a set of simple user level tools 
that is suited to accessing one’s home directory 
over an intermittent connection. The toolkit im- 
plements “partially connected operation” for stan- 


dard file system protocols (e.g., NFS version 2) 
running on unmodified UNIX hosts. The work 
was motivated by the following personal experi- 
ence, which provides an example of partially con- 
nected operation. 

My home computer has a leased line connection 
to an Internet service provider (ISP); my company 
uses a different ISP. In order to avoid being a sys- 
tem administrator, I initially used my home com- 
puter as an X terminal, telnet-ing to the ma- 
chines at my office, Unfortunately, network re- 
sponse is highly variable-—waiting seconds or min- 
utes for keystroke echo is not unusual, outages 
are frequent (e.g., several times per evening) and 
sometimes long (e.g., an entire holiday weekend)— 
enough so that interactive operation over the In- 
ternet proved impractical. 

I considered two alternatives. One was to con- 
tinue using the home computer as a terminal, di- 
aling up to the office. The other was to make the 
home computer as self-contained as possible. My 
home is located in a different area code from my 
office, so dialup accrues long distance charges both 
per-call and per-minute; therefore, I chose the sec- 
ond alternative. 

I implemented a “toolkit” of user-level programs 
that together form what might be viewed as a 
poor man’s version of a distributed file system. 
The system minimizes the connect time between 
the client at home and the server at work, hence 
“partially connected operation.” Compared to the 
pioneering work on file systems supporting dis- 
connected and/or partially connected operation 


1The primary cause of the problem is that home and 
office have different ISPs. The two ISPs connect at the 
“MAE West” router on the west coast, so packets between 
New York City and northern New Jersey—feur miles as the 
crow flies—travel across country and back again, typically 
going through 20 hops. 
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(e.g., [9, 14, 6, 8]), what I have implemented is 
unambitious and not technically groundbreaking. 
However, over time it became clear that operat- 
ing at user level—where tools such as file system 
utilities, scripting languages, etc. are available— 
brings some profound advantages. My toolkit ap- 
proach stands in contrast to the approach based 
on custom kernels and file system protocols. My 
claim is that a simple user-level toolkit can handle 
the most common workload (i.e, files drawn from 
a personal home directory, with updates coming 
from only one side of a connection at any time) 
with considerably less mechanism, more flexibil- 
ity, and much greater ease of deployment. 

In my approach, the home computer is a file 
system client that is disconnected from the servers 
at work except when a phone call is made between 
them. The derived technical problem is how to 
keep the home and office copies consistent while 
minimizing phone charges. The toolkit is used as 
follows. 


1. The office file server (master) exports my 
home directory to the home computer (slave) 
as a separate NFS file system.” 


2. The slave caches a subset of my home di- 
rectory and also has a normal collection of 
slowly changing system files. Cached files are 
kept not in a hidden area, but in the local file 
system, in the same hierarchy, with the same 
names, as at the master. When I am at home 
applications execute on the slave, accessing 
files using the same names as if the applica- 
tion were running at the office, but the files 
come from the cache. 


3. The slave fetches files and directories on de- 
mand, using the NFS protocol (version 2); 
however, it does not mount the master’s ex- 
ported file system in the usual way. Instead, 
the master file system is mounted so that a 
local process is its NFS server. Whenever the 
slave accesses a file with a name that is in 
my home directory, the process —called the 
“caching tool”— receives control. When pos- 
sible, it redirects the access to the local cached 
file with the same name. [If the relevant file 
is not yet cached, the process fetches it and 
copies it to the local cache. 


The caching tool is a modified version of Amd 
[17] (version “upl 102”), a user level auto- 


2At my site, this required some administrative chan ges. 
Ordinarily, several home directories are bundled into a sin- 
gle file system. 
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mounter. Partially connected operation is im- 
plemented as a new file system type (named 
“cfs”). A UNIX client need only upgrade to 
the new version of Amd to get partially con- 
nected access to home directories as well as all 
other services currently supported by Amd. 


4. When I move between office and home, the 
file hierarchy at the new location will be out of 
date with respect to the old location if I made 
changes on the old system. To determine the 
changes to propagate from the old system to 
the new, I run on each system a “checksum 
tool” that computes cryptographic checksums 
for the file hierarchy on that system. 


The checksum of a directory is computed 
over the names and checksums of the direc- 
tory entries. If two corresponding directories 
have the same checksum, the entire hierar- 
chies rooted at those directories are identical. 


5. The checksum tool produces a report that is 
fed into a “comparison tool.” This tool uses 
the report plus knowledge of which site has 
the most recent changes to decide which files 
and directories to copy to or delete from the 
site that is out of date. 


The comparison tool achieves massive prun- 
ing from the fact that equality of two direc- 
tory checksums implies equality of the entire 
hierarchies rooted at those directories. 


This approach does not constitute a traditional 
distributed file system. Rather, it is a toolkit ap- 
propriate only in the case where there is an eas- 
ily identified “single locus of update”—i.e., for ex- 
tended periods, updates are applied either to the 
home file system or to the office server, but not to 
both. The checksum and comparison tools are in- 
voked whenever the locus of updates changes, and 
together they reestablish consistency. Neither file 
system should accept updates while the checksum 
and comparison tools are running. Since home sys- 
tems and portable computers are intimately per- 
sonal, the limitation might not be as severe as it 
may seem at first. Indeed, experience with the 
Ficus file system in a similar setting of a home 
workstation suggests that there is only a single 
locus of updates—that “the user acts as a write 
token” [4, 18]. 

While the single locus of update restriction 
significantly reduces the technical problem ad- 
dressed, the toolkit nevertheless is interesting for 
a few practical reasons. First, the one case it han- 
dles (a single easily identified locus of updates) 
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is a common, and perhaps the dominant, read- 
write file system workload. Further, no kernel 
changes are required at either client or server, 
and only a few configuration changes are needed. 
The major ones are, first, to have the server ex- 
port the home directory as a separate file system, 
and, second, to make the caching tool service the 
mount point at the client. All the tools run at 
user level; only the caching tool must run with 
root privilege. A consequence of the easy deploy- 
ment is that for the first time standard file systems 
(e.g., NFS) are available to ordinary, partially con- 
nected UNIX clients. Finally, this work is highly 
portable. The checksum and comparison tools are 
“ordinary UNIX programs” that need no special 
privilege. The most complicated component, the 
caching tool, is a version of Amd, which is one of 
the most widely ported UNIX programs in exis- 
tence. The caching tool inherits this high degree 
of portability. Perhaps ironically, the tool that is 
by far the hardest to port is the dialup software. 
The next section explains the operation of the 
tools. The section after that presents performance 
analysis. Section 4 discusses related work. 


2 Tools 


The next few sections explain the design and op- 
eration of the tools for checksum computation, file 
hierarchy comparison, caching, and dialup. 


2.1 Checksum Tool 


The checksum tool writes a file — named 
“contents” — in each directory of a hierarchy. 
Such a file contains the type of checksum algo- 
rithm used and the names and checksums of each 
file, directory, or symlink in the directory as of the 
time the tool was invoked. The name/checksum 
entries are sorted by name. Other objects that 
can populate directories (like devices and sockets) 
are ignored. Any pre-existing .contents file is 
also ignored. 

The checksum of a symbolic link is computed 
by not following the link but rather by checksum- 
ming the contents of the link. Not following sym- 
links converts a file system hierarchy from a gen- 
eral graph into a DAG. The DAG is then treated 
as a tree; i.e., a file with multiple links appears 
several times in the tree, once in each directory. 
The checksum of a directory is the checksum of 
its .contents file; therefore, checksumming is a 
bottom-up operation. 
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As an example, supposea directory contains two 
files w, x, a symlink y, and a directory z, with 
checksum values 1, 2, 3, and 4, respectively. The 
following text would be checksummed and the re- 
sulting value considered to be the checksum of the 
directory: 


algo MDS 
wi 
x 2 
y@ 3 
z/ 4 


In this case, 3 is the checksum value of the sym- 
link stored in y. The “@” and “/” symbols denote 
symlinks and directories, respectively. The com- 
parison tool uses this type information, as it may 
need to treat symlinks and directories differently. 

The available checksum algorithms are MD4 
[19], MD5 [20], and UNIX “sum.” The checksum 
algorithm is a modular component of the tool, and 
it is trivial to add new choices. 

An entry that is unreadable is considered not to 
exist and is not listed in .contents. If a directory 
is unwritable, then its checksum field is left blank 
in the .contents file in the directory above. 

If the two file hierarchies to be compared are on 
hosts linked by a distributed file system (such as 
NFS), then the comparison tool can simply read 
the .contents files in corresponding directories of 
the two hierarchies and perform its comparisons 
based on what it reads. However, in some envi- 
ronments it may be impossible or impractical for 
the two hosts to be linked by a distributed file sys- 
tem protocol; instead, some other protocol (such 
as FTP) is used for short-lived and/or limited 
data shipping. To support such environments, the 
checksum tool optionally can also dump all direc- 
tory checksums — and only directory checksums 
— into a “dumpfile.” This file, which is expected 
to be relatively small compared to the entire hi- 
erarchy it summarizes, can then be sent from one 
host to the other, and the receiving host invokes 
the comparison tool, which operates as described 
below. 


2.2 Comparison Tool 


The comparison tool has a few different modes of 
operation. The command line specifies the mode 
as well as which hierarchy has been the most recent 
locus of updates. 

The checksum tool may have written . contents 
files or a dumpfile, and it may or may not be pos- 
sible for the comparison tool to access both hier- 
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archies via normal UNIX calls such as open() and 
fread() (i.e, both hierarchies are accessibleas file 
systems). 

Regardless of which case is exercised, compar- 
ison of checksums proceeds top-down, with every 
match pruning the comparison. In the most ex- 
treme case, the checksums of the root directories 
of the two hierarchies match, and a single compar- 
ison concludes that the entire hierarchies are iden- 
tical. More typically, one or both sides will have 
entries that the other side does not have.* In these 
cases, and in cases where both sides have an en- 
try but with different checksums, the comparison 
tool emits UNIX commands that make updates, 
creates, and deletes in order to bring the outdated 
hierarchy up to date. 

Large files with holes are handled specially, as 
explained in Section 2.3.3: a special copy program 
is used to transfer only sections of a large file. 


2.2.1 Precious Files 


The term “precious file” is taken from rdist [1]. 
At my site there is one significant exception to the 
single locus of updates behavior. We use hlfsd 
[26] to deliver Email into a user’s home directory.* 
Regardless of whether the user is at home or work, 
Email goes into a file on the server at work. 

Special purpose protocols like POP3 [21] or 
IMAP4 [2] exist to remotely manipulate mail- 
boxes, so the comparison tool does not change the 
mailbox (and any other such precious files), leav- 
ing them for the special protocols. 


2.2.2 Fault Tolerance 


Assuming that the single-locus restriction is main- 
tained and that checksumming and comparison 
is done between locus changes, a failure dur- 
ing checksumming or comparison impacts perfor- 
mance but not correctness. 

There is no state kept outside the client and 
server file systems. The checksums are derived in- 
formation, and the copying done by the compari- 
son tool is idempotent assuming that non-precious 
files are not updated while the comparison tool 
runs. Therefore, if there is a failure during the 


3The .contents files are sorted in order to speed the 
task of detecting entries that are in one .contents file but 
not the other. 

4nifsd controls the name space of the system mail spool 
area — e.g., /usr/spool/mail — and converts the name of 
each user’s mailbox into a symlink into the user’s home 
directory. For example, /usr/spool/mail/user becomes a 
symlink to $HOME/ .mailspool/user. 
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checksum/comparison phase, restarting from the 
beginning will establish a correct state. 

The two real dangers are violating the single- 
locus-of-update restriction and making updates 
during the checksum/comparison phase. Section 
2.4 discusses the steps that can be taken against 
these dangers. 


2.3 Caching Tool 


The caching tool is an altered version of the latest 
release of the Amd automounter, a program writ- 
ten by J-S. Pendry [17].° Many of the alterations 
were taken from the Autocacher, which was writ- 
ten by R. Minnich [12]. The Autocacher is itself an 
altered version of an old release of Amd. I ported 
the Autocacher features into the latest version of 
Amd, then added my features. 

This section explains the new features, assuming 
that the reader understands how Amd and Auto- 
cacher operate. Readers lacking this background 
can find it in Appendixes A and B. 

For file systems of type “cfs” (a name taken 
from Autocacher, denoting “caching file system” ), 
the caching tool implements all NFS operations. 
The primary data structures and the implemen- 
tations of read, readdir, readlink, lookup, and 
getattr are largely those ported from the Auto- 
cacher. The new operations are “mutating” oper- 
ations that make updates to the local cache; they 
can be classified into categories: delete (remove, 
rmdir, rename), create (mkdir, create, rename), 
and update (setattr and write). Note that the 
rename operation falls in two categories. 

It is undesirable to have every file system oper- 
ation leave the kernel to go through the caching 
tool; Section 3 shows the negative impact on per- 
formance. However, redirection of all operations 
through the caching tool is necessary, for these 
reasons: 


1. The caching tool needs to know about deletes. 


2. Because of the way deletes are represented 
in the file system, the caching tool must also 
know about creates. If a create operation re- 
creates a deleted name, the delete indicator 
must be deleted. Section 2.3.2 elaborates. 


3. The caching tool must be in control of read 
operations in order to intelligently transfer 
large files (see Section 2.3.3). 


5J-S. Pendry wrote the original Amd. Erez Zadok 
merged patches submitted by many persons and into the 
latest version, named “upl 102.” [25] 
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These reasons justify the caching tool handling 
some, but not all, NFS operations. It handles 
all operations because there is no way — without 
special kernel support such as stackable vnodes [5] 
— to selectively redirect a subset of operations 
through the caching tool while letting others go 
directly to the native file system. The caching tool 
must return a file handle on every lookup, and this 
handle is used for all operations thereafter. 

There are four interesting aspects to the imple- 
mentation: copy policy, deletes, large files, and 
cache management. 


2.3.1 Copy Policy 


The rule, taken from Autocacher, is that files are 
copied from the master lazily but directories are 
copied eagerly. This means that a lookup on an 
uncached directory results in the directory being 
copied and the returned handle pointing to the 
local copy, whereas a lookup on an uncached file 
does not result in a copy. The returned file handle 
is one fabricated by the caching tool and used for 
emulation. Files are cached lazily because it is 
common for a file to be looked up but not accessed. 
A lazy caching policy avoids transferring every file 
that is shown by an 1s command, for instance. 
The decision to treat files and directories differ- 
ently is founded on two assumptions. First, that 
the number of directories and the number of bytes 
occupied by directories are both small compared 
to the same numbers for files. Second, that the 
directory structure of a user’s home area changes 
slowly. The first assumption implies that it does 
not cost much to copy uncached directories to the 
client. The second assumption implies that, once 
cached, a directory is likely to stay at the client. 


2.3.2 Tracking Deletes 


The delete operations require some care. The rea- 
son is that files and directories are demand-fetched 
from the master, meaning that the slave has only a 
subset of the master’s contents. So when the com- 
parison tool finds that a file or directory exists at 
the master but not at the slave, there is poten- 
tial ambiguity: has it been copied to the slave and 
deleted there, or never copied to the slave? 

I considered two approaches to resolving the am- 
biguity. The first is to perform every delete at the 
master as well in the slave’s cache. This removes 
the ambiguity: a file or directory present at the 
master and absent at the slave has not been copied 
to the slave. However, this approach increases the 
number of interactions between slave and master, 
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whichis potentially costly if the network connect- 
ing the two is pay-per-use. 

The second approach, which I used, is to main- 
tain a delete log. With this approach, the ambi- 
guity is removed in the opposite way: if a file or 
directory is present at the master and absent at 
the slave, it has been deleted at the slave if and 
only if the slave has a record of the deletion. 

I maintain what might be considered a dis- 
tributed log: deleted files and directories are re- 
named in the cache. The renamed files and direc- 
tories are called “deletion markers.”® As shown in 
Figure 1, deleted files are truncated and renamed 
with a special prefix. A deleted directory is simi- 
larly renamed provided that it contains only dele- 
tion markers. Before being renamed, the deletion 
markers in the directory are all deleted; Figure 2 
illustrates. 

The create operations make new directories and 
files as needed in the local disk cache, but they 
must be aware of deletion markers. When create 
creates a file, it checks for the associated deletion 
marker and deletes it if found, then creates the 
file. Checksumming will detect any conflict be- 
tween the new file in the cache and the old file 
back at the master. Other interactions between 
create and delete operations are more subtle. 

One subtle interaction is that deletion markers 
must be removed from a deleted directory so that 
later operations will proceed correctly. Suppose 
otherwise: that a deleted directory were simply re- 
named as amarker, and the markers within it were 
preserved. If the same name is later mkdir-ed and 
rmdir-ed again then the .deleted. DIR. name will 
exist and will not be empty at the time of the sec- 
ond rmdir. In this case the rename will fail be- 
cause the marker directory will not be empty. 

A second subtle interaction between creation 
and deletion is that in contrast to create, mkdir 
does not delete an existing deletion marker that 
corresponds to the name it is creating. Once a di- 
rectory has been deleted at the slave, its deletion 
marker remains, even if the directory is re-created. 
The deletion marker will remain until it is deleted 
by the comparison tool. If a directory’s deletion 
marker were deleted when it is re-created, then the 
sequence rmdir foo ; mkdir foo would result in 
a state indistinguishable from not having cached 
the files in foo, and the comparison tool would 
fail to delete the files on the master. Retaining 
both the new directory and the deletion marker 
indicates that it was deleted and recreated. Delet- 


6A similar mechanism is the “whiteout” used for 4.4BSD 
union mounts [16]. 
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Before deletion: 
-rw-r--r-- 


1 djd faculty 7183 Jun 14 22:26 foo 


After deletion, foo is gone and a marker remains: 


-rw-r--r-- idjd faculty 


0 Jun 14 22:26 .DELETED.foo 


Figure 1: File Deletion Using Markers 


If before deletion: 


1995 src 


drwxr-xr-x 2djd faculty 512 Aug 17 

contains: 
-rw-r--r-~ 1djd faculty 7183 Jun 14 22:26 foo 
-rw-r--r-- 1djd faculty 


QO Jun 14 22:26 .DELETED. bar 


Then rmdir will return an error because foo still exists. 


However, if before deletion the contents are: 
QO Jun 14 22:26 .DELETED.foo 
QO Jun 14 22:26 .DELETED. bar 


-Irw-r--r-- 
-Iw-r--r-- 


1 djd faculty 
1 djd faculty 


Then rmdir will succeed and result in the empty directory: 


drwxr-xr-x 2djd faculty 


512 Jun 14 22:27 .DELETED.DIR.src 


Figure 2: Directory Deletion Using Markers 


ing the re-created directory requires removing the 
directory while leaving the deletion marker. 


2.3.3 Large Files 


As described to this point, emulation is accom- 
plished by copying the whole file froma master to 
slave, then having the slave’s caching; tool oper- 
ate on the local copy. If the file is large, and the 
communication link is slow and/or costly, copy- 
ing the whole file can be expensive and cause sub- 
stantial delay. For example, copying a 1MB file 
over a telephone line with effective throughput 
of 2.7KB/sec—75% of 28.8Kb/sec, an optimistic 
scenario—takes well over 6 minutes. Accordingly, 
the toolkit handles large files specially, transfer- 
ring, checksumming, and comparing smaller units. 

When the caching tool “copies” a large file in 
fact what it does is copy the file into a special 
marker file in the same directory and make the 
file’s name be a symlink pointing to the marker.’ 
The caching tool notes that a file is a large one 
and handles reads and writes specially. Read op- 


erations transfer only the requested byte range, as- 


7The name of the marker file is the file’s name preceded 
by “.large-file.”. 
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suming itis not already cached. Write operations 
also copy the requested byte range, if necessary, 
then apply the update. 

The physical file in the cache is kept in a special 
format. The caching tool writes a header region 
at the beginning of the file that describes what 
ranges of the master file have been copied into the 
cache, and at what offset each range can be found 
in the cached file. (This special handling of large 
files is an example of the flexibility that can be 
achieved with user-level tools.) 

When the checksum tool runs, it will always 
compute different checksums for the large file’s 
name at client and server, because the name will 
be a file at the server but a symlink at the client. 
When the comparison tools detects a checksum 
mismatch, and detects that the client name is a 
symlink whereas the server name is a file, it then 
inspects the symlink value to determine if it is a 
marker for a large file. If it is, then a special copy 
utility reads the header region of the cache file and 
copies to the master file only the byte ranges that 
the cache file holds. 

The definition of a “large” file is a parameter 
that can be set either on Amd’s command line or 
via amq, the configuration utility for Amd. The 
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parameter is a number of seconds. The caching 
tool measures the network throughput achieved by 
recent copy operations. A file is “large” if copy- 
ing it would take more than the specified time.® 
By setting the time parameter to zero, the user 
can treat all files as large files. The reason not to 
do is that the checksum/compare phase will take 
substantially longer than usual if every file is a 
checksum mismatch. 


2.3.4 Cache Management 


Storing the cache and the “log” in the local file 
system contrasts with the approach taken by spe- 
cialized in-kernel file systems that implement the 
log and cache as separate data structures, both 
outside the file system name space. 

With the cache in the local file system, ordi- 
nary utilities and scripts can be used as cache 
management tools. This toolset can be superior 
in flexibility, scope, and ease of use compared to 
the mechanisms that file system implementors can 
include into the client side of an in-kernel file sys- 
tem. Therefore, it may be easier to implement 
complex, user-specific and/or time-varying crite- 
ria for removing files or handling special cases. For 
example, cache files can be deleted based on any 
criteria any time the cache is consistent with the 
master. (I put cache cleanup in the logout script 
before the checksum phase.) 

Coda and AFS both maintain a log of all mutat- 
ing operations. The most important purpose for 
a log is to record deletes, but log information is 
also exploited for attempted automatic resolution 
of conflicts. In both systems, the log is a single 
data structure, and managing it is a source of com- 
plexity. If the log fills, processing stops; however, 
for typical workloads, information early in the log 
is superseded by later log entries. This motivates 
techniques such as log compression [7] and “trickle 
discharging” [14]. In my approach, the “log” and 
the cache share the same storage, so neither has 
a size limit that might not be ideally matched to 
the other’s size. 


2.4 Change of Locus 


Change of locus raises three issues: how to detect 
it, exactly when to execute checksum operations, 
and how to detect that the single-locus-of-updates 
assumption has been violated. 


8The first file, accessed before there is any network 
throughput information, is deemed large if it exceeds 50KB. 


Change of locus is detected by running a script. 
The script runs when the user arrives at the new 
site or departs the old site. It can be executed by 
the user or can be invoked automatically as part 
of login/logout or as part of a terminal locking 
program, etc. The tradeoff between running the 
script at the new or old site is performance ver- 
sus safety. If the script is written assuming that 
the user is leaving his old site, the checksum op- 
erations can be overlapped with the user’s move- 
ment, but there must be some means to indicate 
that the script is finished (because the user must 
refrain from making updates at the new site un- 
til the compare/copy operation is finished). I have 
not implemented such a scheme, instead doing one 
checksum operation at logout and the other at lo- 
gin. 

The compare phase takes time proportional to 
the number of updates at the most recent locus. 
The checksum phase takes time proportional to 
the size of the file hierarchy, and the constant fac- 
tor is much larger than for comparison. Therefore, 
checksumming the master is by far the longest op- 
eration. Accordingly, I use login/logout as the in- 
dication of locus change, checksumming the mas- 
ter at logout and running the slave checksum and 
comparison at login. 

Invocation of checksumming at the other site is 
accomplished with rsh. 

Violation of the single-locus assumption can 
be detected simply. In particular, when check- 
sums are computed at the new locus, exist- 
ing .contents files should not be overwrit- 
ten. Instead, they should be moved, say, to 
“ previous-contents.” Then the comparison 
tool not only compares corresponding . contents 
files at the two locations, but also compares 
.previous-contents and .contents at the new 
locus. If they disagree, the single-locus assump- 
tion was violated; i.e., updates were made at both 
the new and old loci. 

Note that consistency re-establishment need not 
occur only at change of locus. That is, a home user 
can invoke the tools in the middle of his session as 
if a change of locus were underway, provided that 
he refrains from updates for the necessary period 
of time. 


2.5 Communication Software 


Although dialup is a mundane function, the dialup 
package is the greatest obstacle to the toolkit’s 
portability and deployment. 

Dialup is a portability problem because what 
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is needed is on demand dialup. That is, a tele- 
phone connection must be brought up automat- 
ically whenever there is outgoing traffic; dialing 
upon login, say, and holding the connection de- 
feats the purpose of the toolkit. Many UNIX 
TCP /IP stacks will simply return an error if there 
is not already an active link device ready to ac- 
cept outgoing traffic. Providing a hook to allow 
a user level script to set up a link on demand is 
an operating system issue, thereby making it hard 
to formulate a solution that is as portable across 
UNIXes like Amd. My home client runs Linux. 
Linux has the diald on-demand dialup package 
[24] that provides the right function. This is what 
I use. 

Dialup can be a deployment problem because 
machines on both sides of the communication link 
need to dial each other. Most installations have 
firewalls, many have restrictions on acceptable 
sources of dial-in and targets of dial-out, and many 
elect not to support SLIP or PPP access. These 
restrictions can be overcome, but may increase the 
amount of administrative changes that must pre- 
cede using the toolkit. 

An issue for the script that sets up and tears 
down the telephone connection is when to tear 
down. In a slightly different context this has been 
called the “Holding Time Problem” [22]. The is- 
sue is that maintaining an open connection costs 
money, but so does restarting a closed connec- 
tion, so shutting down immediately is not neces- 
sarily the best policy. Tariffs are known and, if 
future traffic patterns were known, a simple cal- 
culation would indicate how long an open connec- 
tion should be maintained. However, there is no 
realistic way to predict future traffic, especially 
from a single host. Accordingly, I have changed 
diald to include the heuristic that the telephone 
connection is shut down after an idle period whose 
connect charge equals the charge to make a new 
phone call. This policy limits the overall cost of 
a phone call to twice the charge to make the call 
(the real charge for making the call plus the charge 
for the idle time that causes the shutdown) plus 
the charge for the time used. Here time “used” 
means real use plus all idle periods shorter than 
necessary for a shutdown. 

It is desirable to replace the dialup package 
with a more powerful “communication tool” that 
could direct traffic to my Internet connection op- 
portunistically, falling back to the phone line only 
when the Internet connection becomes unaccept- 
able. Such a tool is even more operating system 
specific than a dialup package. Work is underway 
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TYPE OF LOOKUP LATENCY 


NFS, uncached 
NFS, cached 
Caching tool, uncached 
Caching tool, cached 
Cachefs, uncached 
Cachefs, cached 








Table 1: Cost of Redirection, Lookup 


on a prototype. 


3 Evaluation 


The next few sections report performance mea- 
surements. Section 3.5 summarizes the adminis- 
trative changes that might be needed to deploy 
and use the toolkit on UNIX. 


3.1 Basic Cost of Redirection 


The first measurement determined the basic cost 
of redirection. I did a number of getattr oper- 
ations, measured the overall time to do them as 
well as the time spent in the caching tool, then 
subtracted. The result is the amount of time it 
takes to go into the kernel, have the kernel redi- 
rect the NFS operation to the caching tool, which 
acts as the NFS server, then return the result. 

On a Sun SparcStation-2 running Solaris 2.4, 
the result is 38.1ms with a standard deviation of 
3.2ms. 


3.2 Added Cost of Emulation and 
Caching 


The second experiment determined the time 
needed to do certain common NFS operations such 
as lookup and read. The caching tool was timed 
as was regular NFSv2 and Sun’s “cachefs” file sys- 
tem, an in-kernel file system that caches on disk. 

In the lookup test (Table 1) the parent direc- 
tory had already been looked up. The reported 
times are end-to-end measured from a user level 
program. An “uncached” lookup goes to a remote 
file server. In each test, the standard deviation 
was very small. 

In the read test (Table 2) the file had already 
been opened and one 4KB block was read. The 
reported times are end-to-end measured from a 
user level program. An “uncached” read goes to 
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TYPE OF READ LATENCY 


NFS, uncached 
NFS, cached 


Caching tool, uncached 
Caching tool, cached 
Cachefs, uncached 
Cachefs, cached 





Table 2: Cost of Redirection, Read 


a remote file server. A cached read is drawn from 
memory in the case of NFS, and from disk in the 
case of the caching tool and Sun’s Cachefs. In each 
test, the standard deviation was very small. 

The very slow times for the caching tool show 
the effect of copying data between a process and 
the kernel. For the uncached read via the caching 
tool, the data is copied from the remote server 
to the caching tool, written from the caching tool 
into a local file, then copied back into the kernel 
as the result to return to the requester. 


3.3 Checksum Speed 


I ran the checksum test with my home direc- 
tory as input. My home is on a faster platform 
(SparcCenter-1000 running Solaris 2.5). I mea- 
sured 4.6ms per file with MD4 and 6.2ms per file 
with UNIX sum. For my home directory of 7502 
files comprising 899MB, it takes MD4 close to 35 
seconds to compute checksums. 


3.4 Accuracy of Single-Locus-of- 
Update Assumption 


The effectiveness of the toolkit rests, in large part, 
upon the assumption that home directories typi- 
cally havea single locus of updates except for a few 
easily identified exceptions such as Email delivery. 

Numerous studies have shown that there is very 
little write sharing in file systems; for example, 
[9] examines the number of times consecutive up- 
dates are made by different users. The data is 
suggestive, but “different users” is not the issue. 
It could be that the same user is making consec- 
utive updates, but from different sites. If one of 
these sites becomes disconnected yet still makes 
updates, there would be conflict even though the 
same user made all the updates. 

The toolkit requires the stronger condition that 
while updates are made at home they are not made 
at the office server, and vice versa, for the user's 


entire home directory. I am aware of no study 
that measures usage patterns with sufficient pre- 
cision to decide how dangerous the single-locus-of- 
update assumption is. One complicating factor is 
that it is acceptable for updates to come from any 
number of office-side client machines during the 
time when the locus of updates is at the office. 
To shed some light on the matter, I changed the 
office-side mount daemon so that it timestamped 
its records of remote mount/unmount operations. 
Using the time information one can tell at all times 
how many clients had a file system mounted. I 
have seen no unexplained mounts during the time 
when home is the locus of updates. However, as 
discussed earlier, it is possible to detect when a file 
had been changed at both sites by performing one 
extra checksum operation and one extra compare. 


3.5 Summary of Administrative Ac- 
tions 


The following system administration steps must 
be taken to setup and use the toolkit for partially 
connected access to one’s home directory. Depend- 
ing on an installation’s current policies, these steps 
may or may not represent changes. 


1. At the server: make the user’s home direc- 
tory a separate file system, and export this 
file system so that the user’s home machine 
can mount it. 


2. At the client: install the caching tool as the 
automounter; the caching tool can replace 
Amd version up! 102. 


3. At the client: add the “cfs” entry to the Amd 
maps, and allocate an area of the local file 
system to serve as cache and log. 


4. At the client: 
package. 


install a dialup-on-demand 


5. At both sites: allow IP dial-in and/or dial- 
out to the other site. Allow execution of rsh 
commands sent from one site to the other. 


6. The user: write scripts to invoke check- 
sum, comparison, and cache cleanup. Exe- 
cute these scripts from login/logout or other 
scripts that execute manually or automati- 
cally when the locus of updates switches. 
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4 Comparison to _ Related 


Work 


There are two principal categories of related work. 
One category contains the many commercial prod- 
ucts for “file synchronization.” Well known prod- 
ucts include Symtanec’s PC’ Anywhere, Travel- 
ing Software’s Laplink, and Microsoft’s Traveler’s 
Briefcase. The other category is the published 
work on “normal” distributed file systems that 
have been extended to accommodate users who 
suffer various levels of disconnection. In this cat- 
egory are Coda [9, 10, 11, 13, 14, 15, 23], Ficus 
{3, 4, 18], and the Michigan alterations to AFS 
(6, 7, 8]. 

My work sits in between. The main differ- 
ence with the commercial tools is that operation 
is more automatic, no special software is required 
on the master server, or both. Also, I know of 
no tools, commercial or otherwise, that use hier- 
archical checksums. The main difference with the 
research file systems—besides scope, a special pur- 
pose toolkit versus a full fledged distributed file 
system—is point of view. Perhaps because of the 
history of the general purpose file system projects, 
the point developed in many of the papers cited 
above is that general purpose techniques can be 
extended and adapted to roughly accommodate 
the needs of (completely, partially, and mostly) 
disconnected users. This paper raises and partly 
answers an opposing question: can a large frac- 
tion of the usefulness of distributed file systems for 
partially disconnected users be delivered through 
a portable toolkit that has only a small fraction 
of the complexity? The papers cited above ex- 
plore whether extension of existing distributed file 
system techniques is sufficient to accommodate 
disconnected users; this paper raises the issue of 
whether such techniques are necessary.® 

The main issue is the workload. If the work- 
load upholds the single-locus-of-update assump- 
tion, then the consistency problem becomes triv- 
ial and the need to build support for consistency 
protocols, logging, and conflict resolution into the 
basic file system becomes questionable. 

It is widely accepted that most areas of a file 
namespace can be accurately characterized as ei- 


°There is middle ground. The “fetch only mode” in AFS 
{8] is a step in the direction of my work. In fetch-only mode 
fetches are made but the log is not replayed until a later 
time. Since log replay is what detects (and maybe resolves) 
conflicts, not doing it results in operation similar to mine 
of fetching on demand but not checking for conflicts until 
a locus change. 
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ther personal read-write areas or read-only shared 
areas. Data gathered from the file system projects 
suggests that personal areas typically do have only 
a single locus of updates, and that conflicts aris- 
ing between replicas can be handled simply, as the 
comparison tool does. For example, [4] states: 


Another factor contributing to the 
rarity of conflicts is the effect 
of a human write token .. 

Because updates to personal data 
come primarily from a single user, 
that person serves effectively as 
a ‘‘write token’’ for those files. 
By arranging a pattern of 
reconciliation corresponding to the 
presence of the user, the most 
recent data is almost always 
present when updates are made. 


The “arranging” referred to is the tweaking of sys- 
tem parameters that govern when the reconcilia- 
tion function built into the file system runs. One 
might ask whether—given a fast user level recon- 
ciliation procedure—reconciliation should be done 
that way: by the file system, with involvement of 
the system administrator. Further, [15] states: 


In the vast majority of cases, 
resolution merely involves 
overwriting a stale replica 
with the most current one. 


A third category of relevant work is techniques 
to automatically resolve conflicts [10, 18, 11]. This 
work is valuable in its original context and is com- 
plementary to my work: a slight generalization 
of the comparison tool would have it invoke such 
tools whenever a conflict is found. Operating at 
the user level allows such tools to become quite 
elaborate, if necessary. 

A final point regarding similar tools is that the 
function provided by the toolkit cannot be dupli- 
cated simply by rdist-ing from one site to an- 
other. The difference is that the caching tool 
copies on demand from the master and tracks 
deletes. Copying on demand is more bandwidth- 
efficient, and the tracking of deletes allows the 
comparison tool to propagate deletes from one site 
to another. 


5 Summary 


I have described a small set of tools that can be 
used, in certain circumstances, to maintain con- 
sistency between a master set of files kept on a 
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server and copies kept on a partially disconnected 
client.!° Because only limited server-side system 
administration changes are required, users can de- 
cide individually whether to use the toolkit. 

The work makes two main contributions. First, 
the software enables partially connected users to 
access their NFS home directories from stock 
UNIX clients. Second, the toolkit design provides 
a stark contrast to earlier work on custom file 
systems for partially connected operation. The 
toolkit is not as fast, not as general, and may de- 
pend on outside forces to maintain the validity of 
key assumptions; however, the implementation at 
user level makes the work more portable and pro- 
vides many degrees of freedom in selecting how the 
toolkit operates. 

A partial list of “such degrees of freedom” is: 
when to perform checksums, whether to test for vi- 
olation of the single-locus-of-updates assumption, 
how large is a “large file,” user-programmable 
cache cleanup, ability to invoke programs that au- 
tomatically resolve conflicting updates, and how 
to manage the dialup connection. 


5.1 Future Work 


One obvious area for future work is difference 
copying. A number of commercial packages ex- 
ist for Windows and provide function similar to 
the toolkit. Most of these packages provide for 
copying only differences in changed files. This is a 
complication, as more state must be kept on both 
sides. My personal experience so far has been with 
small text files, so the compare/copy phase runs 
fast enough. But in different settings differencing 
could be crucial. 


Availability 
The toolkit is available at 
http: //www.mcl.cs.columbia.edu/src/cfs. 
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10The “disconnection model” is that connection by tele- 
phone is expensive and to be used frugally, but is al- 
ways available on demand. There are other disconnection 
models. 
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A Amd Operation 


Amd has many features. The most obvious and 
commonly used is to mount remote NFS file sys- 
tems on demand then later, after a period of dis- 
use, to automatically unmount them. 

The following is the sequence of relevant events 
showing how Amd demand-mounts the remote 
file system serv:/u/user the first time that any 
path on that file system (in this example, the file 
/u/user/dir/file) is accessed. 


1. Before Amd is running, directory /u does not 
exist on the local disk. 


2. Based on startup instructions, Amd creates 
/u then mounts itself as the NFS server ser- 
vicing mount point /u. 
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Figure 3: Example Name Resolution with Amd 


What happens 


return vnode for local / 


cross mount point into 
phony "NFS" f/s; return 
vnode that contains a 
made-up file handle 
meaningful only to Amd 
and which denotes the 
"root" of the set of 
links it is emulating 


return vnode for symlink 
whose value is 
"/amd/serv/u/user" 


return vnode for local / 


return vnode for 
local /amd 


return vnode for 
local /amd/serv 


return vnode for 
local /amd/serv/u 


cross mount point into 
real NFS f/s; return 
vnode for root of 
remote exported f/s 


remote f/s returns 
handle for directory 


remote f/s returns 
handle for file 


Amd is also told at startup that when- 
ever it must mount a remote file system 
it should do so within the directory /amd. 
(In this example the convention is that 
if server something1 exports file system 
/something2, then it will be mounted as 
/amd/something1/something?2.) 


3. When someone accesses /u/user/dir/file, 
Amd is contacted during the name res- 
olution because it is the process serv- 
ing mount point /u. Since this is the 
first access to /u/user, Amd _ creates 
the local directories /amd, /amd/serv, 
/amd/serv/u, /amd/serv/u/user, and 
/amd/serv/u/user/dir, mounts the _ re- 
mote NFS file system serv:/u/user onto 
/amd/serv/u/user and thereafter emulates 
a symlink named /u/user whose value is 
/amd/serv/u/user. Since /u/user is an 
emulated path, there is no local disk data 
structure for it. 


4. Access to /u/user/dir/file is then done 
normally by the kernel’s name resolution rou- 
tine. 


First, the kernel calls getattr(u) and 
discovers that /u is a mount point; Amd 
is the server mounted at that point. Sec- 
ond, the name resolution routine calls 
getattr(user) and is told by Amd that 
user is a symlink. Next, the name reso- 
lution routine calls readlink(user) and 
is told by Amd that the symlink’s value is 
/amd/serv/u/user. At this point the origi- 
nal path /u/user/dir/file has been trans- 
formed to /amd/serv/u/user/dir/file. 
This path contains no symlinks or mount 
points serviced by Amd, but rather is an 
“ordinary” path leading to a file on a remote 
NFS file system; i.e., /amd/serv/u/user is 
a directory on the local disk, with remote 
NFS file system serv:/u/user mounted on 
it. The name resolution algorithm will cross 
the mount point and return an NFS handle 
for the remote file. 


Figure 3 indicates the result of each lookup oper- 
ation. 

Because Amd is concerned only with name res- 
olution, it implements only the getattr, lookup, 
readlink, and readdir NFS operations.!! 


11Amd also implements the unlink, rmdir, rename, and 
statfs operations, but these implementations are very lim- 
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B- Autocacher Operation 


The Autocacher is an old version of Amd plus 
some changes to implement caching of remote 
read-only NFS file systems onto the local disk. 
Autocacher emulates symlinks, like Amd, and it 
partially emulates files. Specifically, it implements 
the read operation. 

Using the conventions of the example above, 
when resolving the path /u/user/dir/file 
Amd’s only actions are to transform /u/user 
into a symlink with value /amd/serv/u/user and 
to mount serv:/u/user on the local directory 
/amd/serv/u/user. Access to dir/file is via 
the remote NFS server. Amd neither emulates the 
dir/file subpath nor makes local disk copies of 
dir and file; in contrast, Autocacher does both. 
Autocacher will create local disk copies of the di- 
rectory dir and the file dir/file. 

For efficiency, Autocacher will emulate reads 
against the file. It is common for a file to be looked 
up but not read; e.g, an 1s operation. Therefore, 
Autocacher does not copy a file when it is first 
looked up, but only when it is first read. Oth- 
erwise, executing 1s on a directory would cause 
the entire contents of the directory to be cached. 
When an uncached file is first read, the Autocacher 
copies the file and services all the read operations 
from the process that performed the first lookup. 
The second and later lookups will be directed to 
the local copy. 

More precisely, when Autocacher receives an 
NFS lookup operation, it reacts in one of three 
ways: 


1. If there is a local copy of the file, Autocacher 
returns the file handle for the local file. 


2. If there is no local copy but there is room on 
the local disk for the file, Autocacher returns 
a manufactured file handle that is meaningful 
only to Autocacher. A later read request for 
this file handle will cause the Autocacher to 
copy the file to the local disk and to imple- 
ment read requests by reading from the local 
copy. 


3. If there is no local copy and there is no room 
on the local disk for the file, Autocacher also 
returns a manufactured file handle, but later 
NFS read requests for this file handle will 
cause the Autocacher to emulate a symlink 
pointing to the remote file. 


ited, serving only to help Amd help mask its presence in 
the name resolution process. 
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Just as Amd automatically unmounts recently 
unused file systems, Autocacher automatically 
deletes local cached copies of recently unused re- 
mote files. 
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